CETR Content Extraction via Tag Ratios

We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we ex- tend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering tech- nique which operates on the two-dimensional model, and then evaluate our approach against a large set of alterna- tive methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extrac- tion performance than existing methods, especially across varying web domains, languages and styles.
Date: April 26, 2010
Book Title: Proc. of the 2010 World Wide Web Conference (WWW 2010)
Type: InProceedings
