Realistic content simulation

Table of Contents

1. Introduction
2. How realistic? See for yourself...
3. CSM configuration
    3.1 Content database
    3.2 PGL parameters
    3.3 Injecting generated content with text
4. Known problems

1. Introduction

One of the unique features of Web Polygraph is an ability to simulate realistic Web content. In this context, the word ``content'' stands for the actual bytes that comprise the body of a Web object (as opposed to generic properties such as message size distribution or object popularity).

Realistic content simulation can be used for benchmarking various products or services that depend on or manipulate with object contents. For example, content filtering/blocking proxies and advertisement insertion services should be tested using realistic content.

Content Simulation Module (CSM) in Polygraph is configured using PGL Content type. Usually, Polygraph fills object bodies with semi-random bytes. This manual shows you how to configure Polygraph to simulate realistic HTML content.

2. How realistic? See for yourself...

Realism is a subjective metric. We have set up a simple demo to show you some of the Polygraph capabilities. The demo is powered by stock Polygraph server and PGL configuration discussed below. Since polysrv is not really designed to drive demos like this, here are a few caveats you should know before proceeding.

If you understand and agree with the above terms, please try the URL (not linked to prevent crawlers from endlessly crawling polysrv output). Enjoy.

3. CSM configuration

Configuring CSM is a relatively straightforward process. First, you will need to prepare a database with HTML content that will be used to populate the model. Then you will use PGL to specify the model parameters.

3.1 Content database

To create a content database file, use the cdb program (compiled during "make all").

usage: src/csm/cdb <database.cdb> <command> [file.html ...]
        show  -  dump db contents to stdout
        add   -  absorb file(s) contents

As you can see, cdb can display database contents and add files to the database. If the database does not exist, the ``add'' operation will create it. You can add one or more files at a time. By default, contents to be added is read from the standard input. Alternatively, you can specify file names. Cdb assumes that all input files are in HTML-like format. It is your responsibility to strip off any HTTP headers, if needed.

At the time of writing, borders between input files are not important. The following two command will produce the same results:

example> cat file1.html file2.html file3.html | cdb test.cdb add
example> cdb test.cdb add file1.html file2.html file3.html

During a test, Polygraph will use HTML constructs from the database (and only from the database) to generate HTML pages.

A 1.2 MB (gzipped) content database used for the demo is available.

3.2 PGL parameters

To enable content simulation based on a .cdb database, simply add the content_db option to your Content specification. For example,

Content SimpleContent = {
    mime = { type = "text/html"; extensions = [ ".html" ]; };
    size = exp(11KB);
    cachable = 80%;

    content_db  = "pages.cdb"; // import content templates

Complete PGL configuration that drives the demo is available. To support a human-driven demo like that, you need to tell Polygraph server to ignore URLs by using --ign_urls yes command line option of polysrv. This strange option should not be used for real tests where Polygraph robots and not humans make requests, of course.

3.3 Injecting generated content with text

Many applications that analyze HTML content depend on the presence of well-known keywords. A common example is a content filtering proxy that would deny access to any page that contains the keyword ``sex''. If you can read this page, you are not using such a proxy.

Polygraph allows you to inject generated HTML with arbitrary text. The injections will appear at random places, between HTML tags, not to disturb the HTML code. The following configuration instructs Polygraph to take injections from the "inj.tdb" file and infect 30% of the files. A file is considered ``infected'' if it receives at least one injection. The inject_gap field specifies the distance between two consecutive injections within one file.

Content PoisonedContent = {
    inject_db   = "inj.tdb";    // import text to inject
    infect_prob = 30%;          // portion of injected files
    inject_gap  = exp(100Byte); // average distance between injections

A .tdb file is simply a text file. You can use your favorite editor to maintain this database. New lines separate individual entries. Currently, there is no way to specify an entry that spawns multiple lines. Please let us know if we should add such a feature. Entries can contain arbitrary text, including white space and HTML tags.

A small injection database used for the demo is available.

4. Known problems

Support for realistic content simulation is experimental. Here are some of the problems and bugs that may affect CSM operation at the time of writing.

We appreciate your feedback and requests for new features.