Sampling Cube: A Framework for Statistical OLAP over Sampling Data
Sampling is a popular method of data collection when it is
impossible or too costly to reach the entire population. For
example, television show ratings in the United States are
gathered from a sample of roughly 5,000 households. To use
the results effectively, the samples are further partitioned in
a multidimensional space based on multiple attribute values.
This naturally leads to the desirability of OLAP (Online
Analytical Processing) over sampling data. However, unlike
traditional data, sampling data is inherently uncertain, i.e.,
not representing the full data in the population. Thus, it
is desirable to return not only query results but also the
confidence intervals indicating the reliability of the results.
Moreover, a certain segment in a multidimensional space
may contain none or too few samples. This requires some
additional analysis to return trustable results.
In this paper we propose a Sampling Cube framework,
which efficiently calculates confidence intervals for any multidimensional
query and uses the OLAP structure to group
similar segments to increase sampling size when needed.
Further, to handle high dimensional data, a Sampling Cube
Shell method is proposed to effectively reduce the storage
requirement while still preserving query result quality.
Date: June 02, 2008
Book Title: ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'08)
Type: InProceedings
Edition: Proc 2008
Address: Vancouver, Canada
Downloads: 153
Has 1 soft copy
remote linkBibtex
@InProceedings{Sampling_Cube_A_Framework_for_Statistica,
author = "Xiaolei Li and Jiawei Han and Zhijun Yin and Jae-Gil Lee and Yizhou Sun",
title = "{Sampling Cube: A Framework for Statistical OLAP over Sampling Data}",
month = "June",
year = "2008",
edition = "Proc 2008",
address = ", Vancouver, Canada",
booktitle = "ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'08)",
}