Information Retrieval and Terminology Extraction in Online Resources for Patients with Diabetes
Terminology use, as a mean for information retrieval or document indexing, plays an important role in health literacy. Corpus-based terminology extraction can help in information retrieval by providing “browsing phrases” in order to assess appropriate documents. Creation of specific terminology lists represents an intermediate step between the free text search and the controlled vocabulary, which are used as indexes for document access. There are various online resources (on foreign and/or native language) created for patients with diabetic problems who need information on self-education of basic diabetic knowledge, on self-care activities regarding importance of dietetic food, medications, physical exercises and on self-management of insulin pumps. The research is divided into three interrelated parts aiming to detect the role of terminology in online resources: i) comparison of professional and popular terminology use in English and Croatian manuals and in Croatian online texts ii) semi-automatic statistically-based extraction of terminology from English and Croatian manuals and online Croatian texts, and evaluation of extracted terminology by comparison with three types of reference sets using measures of recall, precision and f-measure iii) comparison and evaluation extracted terminology from the English manual using statistical and hybrid approaches compared with three types of reference sets using measures of recall, precision and f-measure. Extracted terminology candidates are compared with three reference lists: one created by professional medical person, list of highly professional vocabulary and list created by academic non-medical persons, made as intersection of 15 lists. Online texts and manuals contain popular and professional terminology in different proportions: online texts 1:71, English manual 1:4.5, Croatian manual 1:7, all in favour of professional terminology. When comparing results of terminology extraction based on statistical approach, higher scores are obtained for the measure of recall, especially for the lists created by doctor specialist involved in diabetes education and by non-medical persons. Reference list created by diabetologist has almost perfect recall for Croatian web pages, while the reference list suggested by non-medical person corresponds more to terminology used in manuals, especially Croatian version. The list of highly specialized vocabulary contained in MeSH is not included in manuals. When comparing two approaches, higher scores for about 30% are obtained for the hybrid approach based on statistical and language methods. Use of automatic and semi-automatic methods in terminology extraction could contribute to better information retrieval as one aspect of health literacy.