Hierarchical Clustering of Webpages via Cross-Page and In-Page Link Structures

Despite of the wide diversity of web-pages, web-pages re- siding in a particular organization, in most cases, are organized with semantically hierarchic structures. For example, the website of a com- puter science department contains pages about its people, courses and research, among which pages of people are categorized into faculty, staff and students, and pages of research diversify into different areas. Uncov- ering such hierarchic structures could supply users a convenient way of comprehensive navigation and accelerate other web mining tasks. In this study, we extract a similarity matrix among pages via in-page and cross- page link structures, based on which a density-based clustering algorithm is developed, which hierarchically groups densely linked webpages into semantic clusters. Our experiments show that this method is efficient and effective, and sheds light on mining and exploring web structures.
Date: June 21, 2010
Book Title: Proc. 2010 Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD'10)
Type: InProceedings
Downloads: 379

Has 1 soft copy


remote link

Bibtex


@InProceedings{Hierarchical_Clustering_of_Webpages_via_,
  author = "Cindy Xide Lin and Yintao Yu and Jiawei Han and Bing Liu",
  title = "{Hierarchical Clustering of Webpages via Cross-Page and In-Page Link Structures}",
  month = "June",
  year = "2010",
  booktitle = "Proc. 2010 Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD'10)",
}