Registrati | Log in | FAQ      [?] 
CiteULike is a free online bibliography manager. Register and you can start organising your references online.
Recent | Unread | Search | Authors | Tags | Export

Clustering of scientific fields by integrating text mining and bibliometrics

by: Frizo Janssens
(22 May 2007)


View FullText article


X Reviews [Write a review of this article]

There are no reviews of this article

X Find related articles from these CiteULike users

X Find related articles with these CiteULike tags

X Abstract

Increasing dissemination of scientific and technological publications via the Internet, and their availability in large-scale bibliographic databases, has led to tremendous opportunities to improve classification and bibliometric cartography of science and technology. This metascience benefits from the continuous rise of computing power and the development of new algorithms. Paramount challenges still remain, however. This dissertation verifies the hypothesis that accuracy of clustering and classification of scientific fields is enhanced by incorporation of algorithms and techniques from text mining and bibliometrics. Both textual and bibliometric approaches have advantages and intricacies, and both provide different views on the same interlinked corpus of scientific publications or patents. In addition to textual information in such documents, citations between them also constitute huge networks that yield additional information. We incorporate both points of view and show how to improve on existing text-based and bibliometric methods for the mapping of science. The dissertation is organized into three parts: Firstly, we discuss the use of text mining techniques for information retrieval and for mapping of knowledge embedded in text. We introduce and demonstrate our text mining framework and the use of agglomerative hierarchical clustering. We also investigate the relationship between the number of Latent Semantic Indexing factors, the number of clusters, and clustering performance. Furthermore, we describe a combined semi-automatic strategy to determine the optimal number of clusters in a document set. Secondly, we focus on analysis of large networks that emerge from many individual acts of authors citing other scientific works, or collaborating in the same research endeavor. These networks of science and technology can be analyzed with techniques from bibliometrics and graph theory in order to rank important and relevant entities, for clustering or partitioning, and for extraction of communities. Thirdly, we substantiate the complementarity of text mining and bibliometric methods and we propose schemes for the sound integration of both worlds. The performance of unsupervised clustering and classification significantly improves by deeply merging textual content of scientific publications with the structure of citation graphs. Best results are obtained by a clustering method based on statistical meta-analysis, which significantly outperforms text-based and citation-based solutions. Our hybrid strategies for information retrieval and clustering are corroborated by two case studies. The goal of the first is to unravel and visualize the concept structure of the field of library and information science, and to assess the added value of the hybrid approach. The second study is focused on bibliometric properties, cognitive structure and dynamics of the bioinformatics field. We develop a methodology for dynamic hybrid clustering of evolving bibliographic data sets by matching and tracking clusters through time. To conclude, for the complementary text and graph worlds we devise a hybrid clustering approach that jointly considers both paradigms, and we demonstrate that with an integrated stance we obtain a better interpretation of the structure and evolution of scientific fields.


X BibTeX record

X RIS record



RIS BibTeX
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.