Most statistical graphics and statistical methods do not scale well to more than thousands or tens of thousands observations. But large databases exceed these limits easily. One exception are graphs for visualizing categorical data--i.e., counts represented by barcharts or mosaic plots. Fortunately, the data in corporate databases are mostly categorical. This allows for a visualization of even millions of records. Obviously, classical analysis software is not able to handle files of that size, and an analyst is tempted to dump only a subgroup of the data to be able to use his/her analysis tool of choice. But the a priori choice of a subset can be very cumbersome.
This paper highlights how to work on large databases, by facilitating displays and selection tools and techniques for categorical data. Using a two-level data access ("do not extract data from the database until the subset is small enough to handle"), combined with hot set selections (as implemented in DataDesk), the analyst can work seamlessly on even very large databases within one tool.
A first implementation of this technique is presented with the research software MONDRIAN.
|