BlogVox: Separating Blog Wheat from Blog Chaff
Blog posts are often informally written, poorly structured, rife with
spelling and grammatical errors, and feature non-traditional
content. These characteristics make them difficult to process with
standard language analysis tools. Performing linguistic analysis on
blogs is plagued by two additional problems: (i) the presence of spam
blogs and spam comments and (ii) extraneous non-content including
blog-rolls, link-rolls, advertisements and sidebars. We describe
techniques designed to eliminate noisy blog data developed as part of
the BlogVox system - a blog analytics engine we developed for the 2006
TREC Blog Track. The findings in this paper underscore the importance
of removing spurious content from blog collections.
Date: January 07, 2007
Book Title: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007)
Type: InProceedings
Downloads: 1123
Has 1 soft copy
size 431826 bytesBibtex
@InProceedings{BlogVox_Separating_Blog_Wheat_from_Blog_,
author = "Akshay Java and Pranam Kolari and Tim Finin and James Mayfield and Anupam Joshi and Justin Martineau",
title = "{BlogVox: Separating Blog Wheat from Blog Chaff}",
month = "January",
year = "2007",
booktitle = "Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007)",
}