BlogVox: Separating Blog Wheat from Blog Chaff

Blog posts are often informally written, poorly structured, rife with spelling and grammatical errors, and feature non-traditional content. These characteristics make them difficult to process with standard language analysis tools. Performing linguistic analysis on blogs is plagued by two additional problems: (i) the presence of spam blogs and spam comments and (ii) extraneous non-content including blog-rolls, link-rolls, advertisements and sidebars. We describe techniques designed to eliminate noisy blog data developed as part of the BlogVox system - a blog analytics engine we developed for the 2006 TREC Blog Track. The findings in this paper underscore the importance of removing spurious content from blog collections.
Date: January 07, 2007
Book Title: Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007)
Type: InProceedings
Google scholar: RLWcRRhNOBoJ
Google citations: 5 citations
Downloads: 1349

Has 1 soft copy


size 431826 bytes

Bibtex


@InProceedings{BlogVox_Separating_Blog_Wheat_from_Blog_,
  author = "Akshay Java and Pranam Kolari and Tim Finin and James Mayfield and Anupam Joshi and Justin Martineau",
  title = "{BlogVox: Separating Blog Wheat from Blog Chaff}",
  month = "January",
  year = "2007",
  booktitle = "Proceedings of the Workshop on Analytics for Noisy Unstructured Text Data, 20th International Joint Conference on Artificial Intelligence (IJCAI-2007)",
}