Deriving workload parameters from access logs

This page describes a group of Polygraph tools that simplify matching test configuration to real proxy access logs. These tools are available starting with Polygraph version 3.0.

1. For the impatient
2. Introduction
3. Extracting statistics
4. Extracting content
5. access-filter
6. access-order
7. access2pgl
8. access2cdb

1. For the impatient

% access-filter --profile server squid-access.log > filtered-access.log % access-order filtered-access.log | sort ... > ordered-access.log % access2pgl ordered-access.log > workload.pgd % access-filter --profile content squid-access.log > content-access.log % access2cdb --cdbs mycdbs/ content-access.log

2. Introduction

Origin server and proxy access logs contain information that can be used to build Web Polygraph workloads. Access logs can be fed to Polygraph robots "as-is" for trace replay or preprocessed to extract various statistics and content to configure Polygraph robots and servers. This page focuses on the latter approach.

3. Extracting statistics

Typical proxy access logs contain information about request timing, "busy" user periods, response sizes, response time, response status code, etc. With some effort, that information can be extracted in a way suitable for writing Polygraph workloads. Note, however, that an access log alone is usually insufficient to build a complete workload because access logs lack information about HTTP connections, cachability, inter-object relationships, etc. It is sometimes possible and desirable to instrument a proxy to log more information, and the tools discussed here can be adapted to extract additional, custom statistics.

Proxies sometimes log details about transactions that Polygraph cannot accurately reproduce. For example, Polygraph does not yet support FTP transactions and many HTTP response codes. Also, access log entries are usually ordered by entry timestamp rather than request timestamp, which makes them awkward to use for accurate reproduction of request interarrival times. Finally, it is sometimes desirable to base Polygraph workload on a subset of log entries (e.g., log entries originating from "local" end-users). All these factors lead to multiple steps of extracting workload parameters from a raw access log:

Make sure the access log is in Squid access log format. Most proxies can be configured to use that popular format. Alternatively, existing logs can be converted to Squid access log format. Polygraph tools do not use many Squid-specific log fields, simplifying the conversion process. This step is not described here; Polygraph tools assume Squid access log format.

Optionally, remove unwanted log entries using Polygraph's access-filter tool or your own custom filter.

Re-order the access log using a combination of Polygraph's access-order tool and your favorite sort program. This step is required to get accurate request interarrival information. If you do not plan on using custom request-interarrival distributions or session parameters, then you do not have to reorder.

Extract statistics from ordered access logs using Polygraph's access2pgl tool.

Use extracted statistics to write Polygraph workloads. This step is not described here.

4. Extracting content

Access logs do not contain actual response content, of course. However, they do contain URLs that can often be used to download content. Downloaded content can then be fed into Polygraph content databases, to configure simulated servers to use real content for their responses.

If you are building a content database from access logs, two problems must be solved. First, one has to make sure that logged URLs are suitable for re-requesting. While no automated solution is 100% accurate, the "content" profile of the access-filter tool can be useful for eliminating many URLs that probably should not be fetched. Even if you use this filter, please understand that filtered access logs may still contain entries that should not be requested in some environments.

The second problem is splitting access log entries into groups based on logged or guessed content types. In most cases, you want to have separate content databases for "images", "html", "downloads", etc. because those common classes of objects have distinct properties and relationships. The access2cdb tool downloads and stores content in several content databases, based on content types.

5. access-filter

The access-filter tool reads access log entries and writes "good" access log entries to the standard output. The goodness criteria depend on the filter "profile", specified via the required --profile command line option. Many criteria are arbitrary and can be inappropriate for your purposes. With some Perl knowledge, it should be easy to modify the script to use different criteria.

The right profile depends on what statistics or content you plan to extract from the filtered access log. Sometimes, different filtering rules should be used to collect different kinds of statistics. For example, response status code is much less important when measuring request interarrival times than when when measuring response sizes.

At the time of writing, the following profiles and corresponding goodness criteria are defined:

country profile (for extracting request interarrival distributions):

US-based client IP addresses

server profile (for extracting most server-side parameters)

HTTP protocol

2xx and 3xx status codes

GET, POST, and HEAD request methods

content profile (for building content databases):

HTTP protocol

200 status code

GET request method

no query terms in request-URI

You may want to consult with the documentation at the top of the access-filter script source code for current enabled goodness tests.

Besides good log entries, the access-filter tool prints statistics related to its filtering choices. The format details are not documented, but the output is a collection of histograms (number, percentage) of status codes (SC), URI scheme or protocol (PRT), URI query terms (URI), requestion methods (MT), country codes (CC), log entries, client IP addresses (IP), and reasons for disabling a client IP address (Bads). These stats and progress lines are printed to standard error stream.
% access-filter --profile server squid-access.log 1> filtered-access.log 2> filter.stats

The filter does not modify good log entries.

6. access-order

Squid logs an entry when the corresponding transaction has been completed. This means that entries are stored in response completion order rather than request acceptance order. Fortunately, there is enough information in the log to reorder entries based on request acceptance time.

The access-order tool reads access log entries, modifies each read entry so that the first field becomes the time of the request acceptance, and writes the modified entry back. A sort routine can be used to sort the output:
%access-order filtered-access.log | sort -t' ' -n +0 > ordered-access.log

The exact sort command options may differ depending on your environment, but should tell the command that the input is separated by spaces and need to be numerically sorted on the first field.

Sorting should not be needed if you do not plan to use any statistics related to request arrival time.

7. access2pgl

The access2pgl tool reads access log file and prints statistics that can be used for configuring Polygraph. At the time of writing, the following distributions are computed along with related parameters:

Request interarrival times during busy periods

Duration of a busy session period

Number of requests per busy session period

Duration of an idle session period

Response times

Response sizes

Response status codes

Request methods

Request header sizes (requires customized Squid log)

Request body sizes (requires customized Squid log)

User sessions computed on per-IP bases. The access2pgl script measures the delay between sequential requests from the same IP address. If delay is longer than one minute (configurable via the SessionIdleTout constant at the top of the script), then the delay becomes the idle period and the session for the corresponding IP ends. The script prints statistics about busy and idle periods as well as the number of user requests per session, aggregated over all IP addresses.

Depending on the distribution, it is dumped either using PGL selector syntax (for cut-and-pasting into PGL workloads) or using the tabular distribution syntax (for external storing and referring to from PGL files).
% access2pgl ordered-access.log > workload.pgd

An example of computed statistics is available elsewhere.

8. access2cdb

The access2cdb tool reads access log file and and downloads referenced objects, stuffing them into Polygraph Content Database (.cdb) files, based on reported or guessed content type. The user specifies the directory where the files should be created or updated. That directory must exist.

For each access log entry, access2cdb analyses the content type based on a hard-coded table of content types and extensions. Filename extensions are only used if no content type information was logged. You can modify the @ContentGroups variable inside the script to change mapping between content types and content database as well as to add new content types.

Once the content group is selected (based on content type), the access2cdb tool downloads the corresponding URL using wget tool and adds content to the corresponding content database in the user-specified directory (the --cdbs option) using the cdb tool distributed with Polygraph. This process may take a long time for long logs since fetching real content over the Internet is often slow and the script fetches one URL at a time.
% access2cdb --cdbs mycontent/ content-access.log

Both wget and cdb tools must be in your executable path. You can adjust script sources if you want to use a different download tool.

Upon completion, access2cdb prints simple statistics reflecting popularity of content groups and individual content types/extensions.