Re: polygraph workload

From: Alex Rousskov (rousskov@measurement-factory.com)
Date: Sun May 26 2002 - 11:01:18 MDT


On Fri, 24 May 2002, mukesh agrawal wrote:

> I have a question about a characteristic of the polymix-4
> workload. The fraction of the bytes requested that are due to
> large files seems low compared to previously published
> measurements. I placed a graph comparing the polygraph generated
> workload to the earlier measurements at
>
> http://www-2.cs.cmu.edu/~mukesh/loadfrac.ps.gz
>
> The data sets in the graph are:
>
> Calgary
> One year's requests for the CS department websever at
> University of Calgary
> From Arlitt and Williamson -- Sigmetrics 96
> Clarknet
> Two week's requests for a Wash, DC ISP's web server
> From Arlitt and Williamson -- Sigmetrics 96
> WorldCup (busy)
> A couple hours worth of requests for the busiest day of the
> WorldCup '98 website
> WorldCup (last day)
> The entire day's requests for the last day of WorldCup '98.
> Berkeley HomeIP
> The four hour trace from Berkeley's HomeIP study.
> Polygraph
> The workload from a Polygraph run.
>
> The only trace that has a smaller fraction of the bytes due to
> files >100K is the busy trace from the WorldCup site. Even the
> HomeIP trace, which I would expect to be skewed towards small
> files (as the users are connected via modems) has a larger
> fraction of load from large files (files >100K comprise ~23% of
> the load in HomeIP, versus 10% for polygraph run).
>
> So my question is: is the stock polymix-4 workload intended to
> accurately model the fraction of the load due to large objects?

First of all, thank you for doing this comparison and sharing the
results!

We want PolyMix workloads to accurately model a typical [corporate]
proxy environment. As your graph illustrates, most actual environments
differ among themselves. For any given trace, I can find traces that
show up above or below on your graph.

> so, is the distribution I'm seeing consistent with what is
> intended?

Difficult to say. First of all, most (all but one?) of your traces are
for origin servers, not corporate proxies. They also seem to be of
different length, request rate, and hit ratio. How does that affect
your study? I assume that you analyzed Polygraph trace from the top2
phase of the PolyMix-4 run (other phases, especially the fill phase,
should not be used as they do not represent a steady-state case).

The best thing to do would be to (a) accurately compare PolyMix-4
results with whatever proxy traces you have access to, (b) analyse the
causes of the differences, and (c) suggest how PolyMix parameters
should be changed for Polygraph traces to become "better". For
example, should we increase the size of large files, increase the
portion of large files, or change BHR discrimination, etc.? Moreover,
how does changing one parameter affect the overall shape of the
distribution?

We would be more than happy to use your results and recommendations in
future Polygraph workloads!

I would also be interested to see a comparison of PolyMix-4
distributions with other benchmarks that have "standard" workloads
(SPEC, WebBench, Surge?). Do all benchmarks underestimate the portion
of large files? Do some of them get it right? Such a comparison would
be difficult because those standard workloads often designed to
represent different environments.

One of the major improvements of PolyMix would be to have different
classes of origin servers (handling different portions of requests).
Another useful study would be to compare real origin servers and
classify them based on, say, files size distribution (which could also
depend on server popularity).

Please keep us posted on your findings and make specific
recommendations on how the workloads should be adjusted.

Thank you,

Alex.



This archive was generated by hypermail 2b29 : Mon Feb 06 2006 - 12:00:23 MST