Sampling Cube: A Framework for Statistical OLAP over Sampling Data

Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results. In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.
Date: June 02, 2008
Book Title: ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'08)
Type: InProceedings
Edition: Proc 2008
Address: Vancouver, Canada
Downloads: 480

Has 1 soft copy


remote link

Bibtex


@InProceedings{Sampling_Cube_A_Framework_for_Statistica,
  author = "Xiaolei Li and Jiawei Han and Zhijun Yin and Jae-Gil Lee and Yizhou Sun",
  title = "{Sampling Cube: A Framework for Statistical OLAP over Sampling Data}",
  month = "June",
  year = "2008",
  edition = "Proc 2008",
  address = ", Vancouver, Canada",
  booktitle = "ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'08)",
}