Knjižnica Filozofskog fakulteta
Sveučilišta u Zagrebu
Faculty of Humanities and Social Sciences Institutional Repository

Evaluating Full Lemmatization of Croatian Texts


Downloads per month over past year

Agić, Željko and Tadić, Marko and Dovedan, Zdravko. (2009). Evaluating Full Lemmatization of Croatian Texts. In: Recent Advances in Intelligent Information Systems. Challenging Problems of Science: Computer Science . Academic Publishing House EXIT, Warsaw, pp. 175-184. ISBN 978-83-60434-59-8

PDF (English)
Download (116kB) | Preview


The paper presents the implementation and evaluation of a module for full lemmatization of Croatian texts. The module implements several lemmatization procedures, all of them based on merging outputs of the previously developed stochastic morphosyntactic tagger CroTag and the in ectional lexicon of Croatian. Evaluation of the lemmatization module on two test cases, simulating realistic and ideal operating conditions, provided full lemmatization accuracy scores of 96.96 and 98.15 percent on a newspaper corpus, respectively. It is also shown that a majority of errors in this framework, 57.14 percent in the realistic testing scenario, occur on word forms with external homography. Moreover, approximately 80 percent of all lemmatization errors occur on nouns, adjectives, verbs and adverbs in that particular order. Language resources, testing environment and procedure descriptions are provided in the paper along with a discussion of obtained results and possible future research directions.

Item Type: Book Section
Uncontrolled Keywords: full lemmatization, morphosyntactic tagging, Croatian language
Subjects: Information sciences > Social-humanistic informatics
Departments: Department of Information Science
Department of Linguistics
Date Deposited: 29 Oct 2012 15:04
Last Modified: 29 Oct 2012 15:05

Actions (login required)

View Item View Item