| Registrati | Log in | FAQ | [?] |
Clustering of scientific fields by integrating text mining and bibliometricsby: Frizo Janssens
(22 May 2007)
|
Reviews
[Write a review of this article]
There are no reviews of this article
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
AbstractIncreasing dissemination of scientific and technological publications via the Internet, and their availability in large-scale bibliographic databases, has led to tremendous opportunities to improve classification and bibliometric cartography of science and technology. This metascience benefits from the continuous rise of computing power and the development of new algorithms. Paramount challenges still remain, however. This dissertation verifies the hypothesis that accuracy of clustering and classification of scientific fields is enhanced by incorporation of algorithms and techniques from text mining and bibliometrics. Both textual and bibliometric approaches have advantages and intricacies, and both provide different views on the same interlinked corpus of scientific publications or patents. In addition to textual information in such documents, citations between them also constitute huge networks that yield additional information. We incorporate both points of view and show how to improve on existing text-based and bibliometric methods for the mapping of science. The dissertation is organized into three parts: Firstly, we discuss the use of text mining techniques for information retrieval and for mapping of knowledge embedded in text. We introduce and demonstrate our text mining framework and the use of agglomerative hierarchical clustering. We also investigate the relationship between the number of Latent Semantic Indexing factors, the number of clusters, and clustering performance. Furthermore, we describe a combined semi-automatic strategy to determine the optimal number of clusters in a document set. Secondly, we focus on analysis of large networks that emerge from many individual acts of authors citing other scientific works, or collaborating in the same research endeavor. These networks of science and technology can be analyzed with techniques from bibliometrics and graph theory in order to rank important and relevant entities, for clustering or partitioning, and for extraction of communities. Thirdly, we substantiate the complementarity of text mining and bibliometric methods and we propose schemes for the sound integration of both worlds. The performance of unsupervised clustering and classification significantly improves by deeply merging textual content of scientific publications with the structure of citation graphs. Best results are obtained by a clustering method based on statistical meta-analysis, which significantly outperforms text-based and citation-based solutions. Our hybrid strategies for information retrieval and clustering are corroborated by two case studies. The goal of the first is to unravel and visualize the concept structure of the field of library and information science, and to assess the added value of the hybrid approach. The second study is focused on bibliometric properties, cognitive structure and dynamics of the bioinformatics field. We develop a methodology for dynamic hybrid clustering of evolving bibliographic data sets by matching and tracking clusters through time. To conclude, for the complementary text and graph worlds we devise a hybrid clustering approach that jointly considers both paradigms, and we demonstrate that with an integrated stance we obtain a better interpretation of the structure and evolution of scientific fields.
BibTeX record
RIS record