NAME: msnbclength.dat (Internet Data Analysis for Undergrad Curriculum)
TYPE: Observational
SIZE: 50000 rows, one for each user.
DESCRIPTIVE ABSTRACT:
The data set gives a random sample of the length of visits of users entering the msnbc.com web site during September 28, 1999.
The length of the visit is an estimate of the total number of clicks or pages seen by each user and is based on web server
logs, thus it counts only pages recorded by the server. Pages cached in the user's browser or in a cache proxy server are
unknown. The data set used in the paper is much larger than the one made available here but that larger data set is also
available in a page cited in the references.
SOURCE:
The data were extracted from the clickstream data set in the UCI KDD Archive which itself comes from Internet Information
Server (IIS) logs for msnbc.com and news-related portions of msn.com processed by Heckerman, 2003. The reader is welcome to
request from the authors the Perl program that converts the clickstream data into the length data described here.
VARIABLES DESCRIPTIONS:
Length Numerical variable summarizing the length of the visit to msnbc site. There are no missing values
STORY BEHIND THE DATA:
Once a user enters a web site how many pages or links within the site does that user visit? The answer to this question may
suggest actions to improve the site. If similar distributions for the number of pages visited per user are observed at
different web sites, then maybe some laws can be established for all sites. Research efforts in this area are directed at
finding these laws. This is a small part of the current effort to understand human behavior on the web.
PEDAGOGICAL NOTES:
The length data set is interesting to introduce students to the notion of skewed distribution with thick tails, where rare
events are not so rare. This is a common feature of a lot of Internet data, which makes the probability distributions we
usually teach inappropriate to model their behavior. In an Introductory Statistics class that is calculus based, or a
mathematical statistics class, the length data set gives students a chance to discover the inverse Gaussian distribution
and to do q-q plots of the data against that distributions suggested in the literature. Plots of histograms and qq-plots
and summary statistics should be done for length less than 100, as there are some lengths in the data set that are much
higher and obscure the behavior below 100. In the lower division Introductory Statistics class, the data can be used to
illustrate with box plots that the outliers are numerous in the skewed distribution of the data, too many to be just
outliers, and introduce the notion of thick tail distributions. All the standard descriptive data analysis can also be
done. Also, sampling to illustrate the Central Limit theorem can also be done.
REFERENCES:
http://www.stat.ucla.edu/~jsanchez/oid03/csstats/index.htm (this site contails the large data set used in the paper
msnbclength.txt.
http://kdd.ics.uci.edu/databases/msnbc/msnbc.html
SUBMITTED BY:
Juana Sanchez
UCLA Department of Statistics
8125 Math Sciences Building
Box 951554
Los Angeles, CA 90095-1554
jsanchez@stat.ucla.edu