Abstract:
|
It is fairly easy to instrument a Web site to collect an extensive warehouses of click stream data. These datasets literally contain every visitor click, page view, cache hit, ad impression, referral, and may even contain transaction information. The data quality problem is that this rich and valuable data source is highly contaminated with uninteresting machine-generated traffic from robots, spiders, Web bots, and is cluttered with errors. On smaller sites the machine-generated traffic may be as 50% of the total. To exploit this data source for understanding visitor behavior, we must overcome three significant analysis problems. First, the huge volume of data easily overwhelms conventional analysis tools. Second, the data must be cleaned and transformed to avoid making improper inferences based on machine-generated traffic and log errors. And third, effective analysis that creates value involves correlating visitor behavior with other factors that can be used to influence it.
|