Realistic content simulation

1. Introduction
2. How realistic? See for yourself...
3. CSM configuration
    3.1 Content database
    3.2 PGL parameters
    3.3 Injecting generated content with text
4. Known problems

1. Introduction

One of the unique features of Web Polygraph is an ability to simulate realistic Web content. In this context, the word ``content'' stands for the actual bytes that comprise the body of a Web object (as opposed to generic properties such as message size distribution or object popularity).

Realistic content simulation can be used for benchmarking various products or services that depend on or manipulate with object contents. For example, content filtering/blocking proxies and advertisement insertion services should be tested using realistic content.

Content Simulation Module (CSM) in Polygraph is configured using PGL Content type. Usually, Polygraph fills object bodies with semi-random bytes. This manual shows you how to configure Polygraph to simulate realistic HTML content.

2. How realistic? See for yourself...

Realism is a subjective metric. We have set up a simple demo to show you some of the Polygraph capabilities. The demo is powered by stock Polygraph server and PGL configuration discussed below. Since polysrv is not really designed to drive demos like this, here are a few caveats you should know before proceeding.

Simulated content is produced using parts of random real Web sites, without discrimination. If you might be bothered by the content of those sites, you are hereby explicitly prohibited from using the demo. We will accept absolutely no complaints regarding the content being generated by Polygraph.

View at least ten generated pages before making your judgment; simulation is a random process, after-all.

Press ``Reload'' button of your browser to get a new page. Following a link may also work.

Generated content was infected with these phrases to demonstrate text injection capabilities discussed below.

Most images will appear ``broken'' as Polygraph cannot simulate realistic image content yet.

Javascript and other embedded ``code'' on the pages may steer you from the demo page, open new windows, etc.

View at your own risk. The content is real and may cause undesired side-effects.

Browsers or Polygraph may be confused by some of the messages. Reload a few times if you get ``document contained no data'' or similar errors.

If the demo is down, please submit a bug report. We may not reply to your message, but will try to fix the problem.

If you understand and agree with the above terms, please try the http://www.web-polygraph.org:8181 URL (not linked to prevent crawlers from endlessly crawling polysrv output). Enjoy.

3. CSM configuration

Configuring CSM is a relatively straightforward process. First, you will need to prepare a database with HTML content that will be used to populate the model. Then you will use PGL to specify the model parameters.
3.1 Content database

To create a content database file, use the cdb program (compiled during "make all").
usage: src/csm/cdb <database.cdb> <command> [file.html ...]
commands: 
        show  -  dump db contents to stdout
        add   -  absorb file(s) contents
As you can see, cdb can display database contents and add files to the database. If the database does not exist, the ``add'' operation will create it. You can add one or more files at a time. By default, contents to be added is read from the standard input. Alternatively, you can specify file names. Cdb assumes that all input files are in HTML-like format. It is your responsibility to strip off any HTTP headers, if needed.
At the time of writing, borders between input files are not important. The following two command will produce the same results:
example> cat file1.html file2.html file3.html | cdb test.cdb add
example> cdb test.cdb add file1.html file2.html file3.html
During a test, Polygraph will use HTML constructs from the database (and only from the database) to generate HTML pages.

A 1.2 MB (gzipped) content database used for the demo is available.

3.2 PGL parameters

To enable content simulation based on a .cdb database, simply add the content_db option to your Content specification. For example,
Content SimpleContent = {
    mime = { type = "text/html"; extensions = [ ".html" ]; };
    size = exp(11KB);
    cachable = 80%;

    content_db  = "pages.cdb"; // import content templates
    ...
};
Complete PGL configuration that drives the demo is available. To support a human-driven demo like that, you need to tell Polygraph server to ignore URLs by using --ign_urls yes command line option of polysrv. This strange option should not be used for real tests where Polygraph robots and not humans make requests, of course.

3.3 Injecting generated content with text

Many applications that analyze HTML content depend on the presence of well-known keywords. A common example is a content filtering proxy that would deny access to any page that contains the keyword ``sex''. If you can read this page, you are not using such a proxy.

Polygraph allows you to inject generated HTML with arbitrary text. The injections will appear at random places, between HTML tags, not to disturb the HTML code. The following configuration instructs Polygraph to take injections from the "inj.tdb" file and infect 30% of the files. A file is considered ``infected'' if it receives at least one injection. The inject_gap field specifies the distance between two consecutive injections within one file.
Content PoisonedContent = {
    ...
    inject_db   = "inj.tdb";    // import text to inject
    infect_prob = 30%;          // portion of injected files
    inject_gap  = exp(100Byte); // average distance between injections
};
A .tdb file is simply a text file. You can use your favorite editor to maintain this database. New lines separate individual entries. Currently, there is no way to specify an entry that spawns multiple lines. Please let us know if we should add such a feature. Entries can contain arbitrary text, including white space and HTML tags.
A small injection database used for the demo is available.

4. Known problems

Support for realistic content simulation is experimental. Here are some of the problems and bugs that may affect CSM operation at the time of writing.

Generated HTML may not be 100% valid at the beginning and at the end of a page. To be precise, opening HTML tags like <table> may not have their closing counterparts (</table>) and vice versa.

HTML comments may not be handled correctly by cdb.

Cdb rewrites all absolute URLs in input files into relative URLs. This can be considered as a feature that helps with browsing of the generated pages.

Browsers cannot display images embedded into the generated HTML.

Injected text will not disturb HTML code. Moreover, it will not change the size of the text ``paragraphs'' being modified. Thus, if a string-to-be-injected-with is quite long, CSM would have to skip a lot of paragraphs to find a large enough area to be injected.

A side effect of the above scheme is that the inject_gap parameter cannot be honored in all cases. The actual gaps are likely to be longer.

Injected text will not overwrite prior injections. This limitation can be considered as a feature.

Polygraph may add random text at the end of the page if the next template in the database was too large to fit at the end of the page (note that object sizes are determined prior to generating content using user-specified distributions, just as before).

Content may not be 100% reproduceable. Current algorithm may behave differently depending on the amount of I/O buffers available at the time of content generation. This is a bug. Only pages larger than ~15KB in size can be affected.

Simulating realistic HTML and injecting text slows Polygraph down by at least 50% on simple workloads.

We appreciate your feedback and requests for new features.

Realistic content simulation

Table of Contents

1. Introduction

2. How realistic? See for yourself...

3. CSM configuration

3.1 Content database

3.2 PGL parameters

3.3 Injecting generated content with text

4. Known problems