A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data
Recent approaches in classifying evolving data streams
are based on supervised learning algorithms, which can be
trained with labeled data only. Manual labeling of data
is both costly and time consuming. Therefore, in a real
streaming environment, where huge volumes of data appear
at a high speed, labeled data may be very scarce. Thus,
only a limited amount of training data may be available for
building the classification models, leading to poorly trained
classifiers. We apply a novel technique to overcome this
problem by building a classification model from a training
set having both unlabeled and a small amount of labeled
instances. This model is built as micro-clusters using semisupervised
clustering technique and classification is performed
with κ-nearest neighbor algorithm. An ensemble of
these models is used to classify the unlabeled data. Empirical
evaluation on both synthetic data and real botnet traffic
reveals that our approach, using only a small amount of labeled
data for training, outperforms state-of-the-art stream
classification algorithms that use twenty times more labeled
data than our approach.
Date: December 30, 2008
Type: Article
Series: Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008
Downloads: 406
Has 1 soft copy
size 139573 bytesBibtex
@Article{A_Practical_Approach_to_Classify_Evolvin,
author = "Mohammad Masud and Jing Gao and Latifur Khan and Jiawei Han and Xiaohu Li",
title = "{A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data}",
month = "December",
year = "2008",
series = "Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008",
}