Trace replay

This page describes how to replay URL traces with Polygraph. DNS-related replay features are available starting with Polygraph version 3.0. The ${url_number} macro is supported since version 4.0.4.

Table of Contents

1. For the impatient
2. Introduction
3. Trace format
4. Client side
5. Server side
6. DNS
7. url_number macro
8. Example

1. For the impatient

Server S = {
    addresses = ...;
};

Robot R = {
    interests = [ "foreign" ];
    foreign_trace = "/tmp/trace.urls";
    ...
    origins = S.addresses;
    dns_resolver = ...;
};

AddrMap M = {
    zone = ".";
    addresses = S.addresses;
    names = tracedHosts(R.foreign_trace);
};

2. Introduction

Polygraph supports replaying of URL traces: Polygraph robots load the entire trace into RAM and use traced URLs for some or all of the generated requests. Trace replaying can be useful for testing URL and content filters as well as for introducing real origin servers into the mix. In general, both real and Polygrpaph servers can be used to replay a trace.

The section below document trace format and explain how to configure the client, server, and DNS sides of a test for trace replay.

3. Trace format

The simplest trace is a plain text file, with one HTTP URL per line. Polygraph can also accept many proxy "access logs" as traces.

http://www.example.com/
http://www.example.com:8080/
http://www.example.com/path${url_number}with/macro
https://www.example.com/path/index.html
ftp://172.16.0.1:80/path/index.htm

When parsing a trace, Polygraph ignores comments and empty lines. A comment starts with a "#" character and continues to the end of the line. To find a URL on a line, Polygraph looks for the first sequence of non-space characters starting with a protocol scheme: "http://", "https://", or "ftp://". Once the first URL is found, Polygraph continues to the next line. Since most access logs contain request URI as the first URL in a log entry, Polygraph can handle access logs without knowing their exact format.

Polygraph ignores all URIs except for HTTP, HTTPS and FTP protocol schemes because it cannot fetch them.

# one can use comments to describe traces:
# this trace came from http://www.example.com/
# the above URL will not be considered part of a trace

# URLs below will be used
http://www.example.com/
https://www.example.com/
ftp://www.example.com/

# URL below will be skipped because it has unsupported scheme
svn://www.example.com/path/index.html

# the anchor part below will be ignored as a comment
http://www.example.com/index.html#anchorToIgnore

# IP addresses and port numbers are fine
http://172.16.0.1:8080/path/index.htm

# there is no URL on the next line, from Polygraph point of view
www.example.com/index.html

# only the example.org URL will be noticed and used:
12 example.com http://www.example.org/ 34513 http://example.net/

# the following three URLs will be distinct when requested
# the numbers increase as the trace wraps (see below for details)
http://www.example.com/${url_number}.html # number  8, 16, 24...
http://www.example.com/${url_number}.html # number  9, 17, 25...
http://www.example.com/${url_number}.html # number 10, 18, 26...

4. Client side

A Polygraph robot can be configured to requests URLs from a trace using the combination of interests and foreign_trace options:

Robot R = {
    interests = [ "foreign" ]; // use traces
    foreign_trace = "/tmp/test.urls";
    ...
};

In most cases, it is a good idea to preserve some Polygraph-generated traffic in the test. Without such traffic, Polygraph may not be able to synchronize phases and may not even leave the first phase. You can preserve Polygraph-generated URLs by adding another Robot with public interest or by combining multiple interests:

Robot R = {
    interests = [ "public": 1%, "foreign" ]; // 99% from trace
    foreign_trace = "/tmp/test.urls";
    ...
};

To generate n-th miss, the robot requests the n-th URL from the trace, modulo trace length. Thus, by default, when all URLs have been visited, no true misses will be generated and actual recurrence ratio will not match the configured one. If accurate recurrence ratio and working set size enforcement are important, the test must either stop before all trace URLs are used up or a ${url_number} macro must be used in each trace URL.

Just like generated URLs, trace URLs are revisited to generate hits if recurrence ratio is positive. Also note that when traced URLs are not generated for each test, a cache may already store matching responses from previous tests. Flush the cache before each test if you want to avoid this "memory effect". If you need a feature to make each trace unique across tests, please ask the developers to add support for a ${test_id} macro.

A robot does not check that a traced URL belongs to one of the known origin servers (i.e., has its host listed in the origins field). This implies that Polygraph can be used to request traced URLs from both Polygraph origin servers and real or, more precisely, "foreign" or "not listed in the test configuration") origin servers.

All traced URLs are "foreign" URLs. Polygraph robots will report the number of foreign URLs requested and the number of corresponding responses. For example, the console output below shows that 747 requests using foreign URLs were sent and all of them were responded to. Note that the total number of responses (754) is slightly higher, indicating that seven Polygraph-specific URLs were generated and requested as well.

000.54| i-dflt    754  28.40      1   0.00   0    1
000.54| foreign URLs: requested: 747 served: 747

Other source of foreign URLs are foreign URLs embedded in responses that robots are configured to parse.

5. Server side

If the trace contains Polygraph server addresses, Polygraph servers will receive traced URLs. The --accept_foreign_msgs yes command-line option must then be used or the servers will refuse to serve any content and close the connection. If the option is set, the servers will respond, using the first content type configuration to generate the response.

% polysrv --accept_foreign_msgs yes --config ...

If the trace does not contain Polygraph server addresses, then no special server-side configuration is needed as far as trace replay is concerned. However, it is usually a good idea to still have at least some Polygraph-specific traffic reaching Polygraph servers (see client-side discussion above for details).

6. DNS

When the trace contains domain names (and not just host IP addresses), Polygraph robots and/or the proxy need to resolve those names. When the trace contains many real domain names, and the use of real resolvers is not desirable, one has to configure a root name server to resolve all host names in a trace. This can be done using the dns_cfg tool that comes with Web Polygraph.

The dns_cfg tool can convert a PGL configuration file that uses address maps into forward and reverse zone configuration files (and BIND configuration file). It is easy to get unique host names from the trace into address map using the tracedHosts() PGL function:

AddrMap Map = {
    zone = "."; // root zone
    addresses = ... // usually origin server addresses
    names = tracedHosts("/tmp/test.urls");
};

If the trace contains a mixture of host names from different TLDs, you should use the root zone in the PGL address map, as illustrated above. More information about dns_cfg is available elsewhere

7. url_number macro

A trace URL may contain one or more ${url_number} macros:

http://example.com/path_${url_number}/index${url_number}.html

A ${url_number} macro is replaced with a URL "number", printed with some 0-padding and in hex:

http://example.com/path_000001A3/index000001A3.html

Before the trace wraps, the URL number is the position of the URL in the trace. After each wrap, the total number of trace URLs is added to each URL number. Thus, when the macro is used in each trace URL, Polygraph produces unique URLs even after the trace wraps. Note that when URL is revisited it gets the same URL number as before so you get the expected recurrence ratio.

The ${url_number} macro can be used both in the host name part of the URL and in the path.

8. Example

Test_trace.pg, a very simple but complete and functioning workload that can be used for replaying a trace, is available Polygraph distributions starting with version 3.0. Just bring your own trace.