Knjižnica Filozofskog fakulteta
Sveučilišta u Zagrebu
Faculty of Humanities and Social Sciences Institutional Repository

Multilevel presentation model of old Croatian dictionary texts


Downloads per month over past year

Bago, Petra. (2014). Multilevel presentation model of old Croatian dictionary texts. PhD Thesis. Filozofski fakultet u Zagrebu, Department of Information Science.
(Poslijediplomski doktorski studij informacijskih i komunikacijskih znanosti) [mentor Boras, Damir and Ljubešić, Nikola].

PDF (Croatian)
Download (4MB) | Preview


The aim of this research is to develop a multilevel presentation model of old Croatian dictionary texts. We enabled interoperability with other language resources, tools and systems for natural language processing. The presentation model is conducted on seven selected dictionaries printed from 1595 to 1901. By using a de facto standard (Text Encoding Initiative), we enabled interoperability of the resource. Finally, we applied automatic and semiautomatic natural language processing methods for digitized historical texts thereby speeding up and simplifying the process of processing old dictionary texts. We use the state-of-the-art supervised machine learning algorithm for sequence annotation called conditional random fields (CRF). Thisphase of the research is conducted on one dictionary with the most complex structure of the dictionary entries. The dataset contains 7,972 dictionary entries (403,128 tokens). The training set consists of 101 randomly selected dictionary entries (8,340 tokens). We labeled each token on two levels: a language annotation and a structural annotation. The language level has three labels, while the structural annotation has 19 labels. We reach accuracy of 98.413 % for language annotation and 96.371 % for structural annotation. Additional experiment confirmed that only correcting generated labels is roughly 4.46 times faster than full manual annotation.

Item Type: PhD Thesis
Uncontrolled Keywords: historical dictionaries, language annotation, structural annotation, supervised machine learning, Text Encoding Initiative, conditional random fields
Subjects: Information sciences > Social-humanistic informatics
Departments: Department of Information Science
Supervisor: Boras, Damir and Ljubešić, Nikola
Additional Information: Poslijediplomski doktorski studij informacijskih i komunikacijskih znanosti
Date Deposited: 25 Feb 2015 10:39
Last Modified: 04 Apr 2017 10:21

Actions (login required)

View Item View Item