|
Home · Search · Print · Help
Trace replay
This page describes how to replay URL traces with Polygraph. Trace replay
functionality has been in Polygraph for a while, but DNS-related replay
features are available starting with Polygraph version 3.0.
Table of Contents
1. For the impatient
2. Introduction
3. Trace format
4. Client side
5. Server side
6. DNS
7. Example
1. For the impatient
Server S = {
addresses = ...;
};
Robot R = {
interests = [ "foreign" ];
foreign_trace = "/tmp/trace.urls";
...
origins = S.addresses;
dns_resolver = ...;
};
AddrMap M = {
zone = ".";
addresses = S.addresses;
names = tracedHosts(R.foreign_trace);
};
2. Introduction
Polygraph supports replaying of URL traces: Polygraph robots load the
entire trace into RAM and use traced URLs for some or all of the generated
requests. Trace replaying can be useful for testing URL and content
filters as well as for introducing real origin servers into the mix. In
general, both real and Polygrpaph servers can be used to replay a
trace.
The section below document trace format and explain how to configure
the client, server, and DNS sides of a test for trace replay.
3. Trace format
The simplest trace is a plain text file, with one HTTP URL per
line. Polygraph can also accept many proxy "access logs" as
traces.
http://www.example.com/
http://www.example.com:8080/
http://www.example.com/path/index.html
http://172.16.0.1:80/path/index.htm
When parsing a trace, Polygraph ignores comments and empty lines.
A comment starts with a "#" character and continues to the end of the
line. To find a URL on a line, Polygraph looks for the first sequence
of non-space characters starting with "http://". Once the first URL is
found, Polygraph continues with the next line. Since most access logs
contain request URI as the first URL in a log entry, Polygraph can
handle access logs without knowing their exact format.
Polygraph ignores non-HTTP URIs because it cannot fetch them.
# one can use comments to describe traces:
# this trace came from http://www.example.com/
# the above URL will not be considered part of a trace
# URL below will be used
http://www.example.com/
# URL below will be skipped because it does not use http schema
ftp://www.example.com/path/index.html
# the anchor part below will be ignored as a comment
http://www.example.com/index.html#anchorToIgnore
# IP addresses and port numbers are fine
http://172.16.0.1:8080/path/index.htm
# there is no URL on the next line, from Polygraph point of view
www.example.com/index.html
# only the example.org URL will be noticed and used:
12 example.com http://www.example.org/ 34513 http://example.net/
4. Client side
A Polygraph robot can
be configured to requests URLs from a trace using the combination of
interests and foreign_trace options:
Robot R = {
interests = [ "foreign" ]; // use traces
foreign_trace = "/tmp/test.urls";
...
};
In most cases, it is a good idea to preserve some Polygraph-generated
traffic in the test. Without such traffic, Polygraph may not be able to
synchronize phases and may not even leave the first phase. You can
preserve Polygraph-generated URLs by adding another Robot with public
and/or private interest or by combining multiple interests:
Robot R = {
interests = [ "private": 1%, "foreign" ]; // 99% from trace
foreign_trace = "/tmp/test.urls";
...
};
To generate i-th miss, the robot requests the n-th
URL from the trace, modulo trace length. Thus, when all URLs have been
visited, no true misses will be generated and actual recurrance ratio will
not match the configured one. If acuurate recurrance ratio is important,
the test must stop before all URLs are used up. URLs are revisited to
generate hits if recurrance ratio is positive. Also note that when traced
URLs are not generated for each test, a cache may already store matching
responses from previous tests. Flush the cache before each test if you
want to avoid the "memory effect".
A robot does not check that a traced URL belongs to one of the known
origin servers (i.e., has its host listed in the origins field). This
implies that Polygraph can be used to request traced URLs from both
Polygraph origin servers and real or, more precisely, "foreign" or "not
listed in the test configuration") origin servers.
All traced URLs are "foreign" URLs. Polygraph robots will report the
number of foreign URLs requested and the number of corresponding
repsonses. For example, the console output below shows that 747 requests
using foreign URLs were sent and all of them were responded to. Note that
the total number of responses (754) is slightly higher, indicating that
seven Polygraph-specific URLs used as well.
000.54| i-dflt 754 28.40 1 0.00 0 1
000.54| foreign URLs: requested: 747 served: 747
Other source of foreign URLs are foreign URLs embedded in responses
that robots are configured to parse.
5. Server side
If the trace contains Polygraph server addresses, Polygraph servers
will receive traced URLs. The --accept_foreign_msgs yes
command-line option must then be used or the servers will refuse to serve
any content and close the connection. If the option is set, the servers
will respond, using the first content type configuration to generate the
response.
% polysrv --accept_foreign_msgs yes --config ...
If the trace does not contain Polygraph server addresses, then no
special server-side configuration is needed as far as trace replay is
concerned. However, it is usually a good idea to still have at least some
Polygaph-specific traffic reaching Polygraph servers (see clisnt-side
discussion above for details).
6. DNS
When the trace contains domain names (and not just host IP addresses),
Polygraph robots and/or the proxy need to resolve those names. When the
trace contains many real domain names, and the use of real resolvers is
not desirable, one has to configure a root name server to resolve all host
names in a trace. This can be done using the dns_cfg
tool that comes with Web Polygraph.
The dns_cfg
tool can convert a PGL configuration file that uses address maps into
forward and reverse zone configuration files (and BIND configuration
file). It is easy to get unique host names from the trace into address map
using the tracedHosts() PGL
function:
AddrMap Map = {
zone = "."; // root zone
addresses = ... // usually origin server addresses
names = tracedHosts("/tmp/test.urls");
};
If the trace contains a mixture of host names from different TLDs, you
should use the root zone in the PGL address map, as illustrated above.
More information about dns_cfg is available elsewhere
7. Example
Test_trace.pg, a very
simple but complete and functioning workload that can be used for
replaying a trace, is available Polygraph distributions starting with
version 3.0. Just bring your own trace.
Home · Search · Print · Help
|