Agić, Željko and Tadić, Marko and Dovedan, Zdravko. (2009). Tagset Reductions in Morphosyntactic Tagging of Croatian Texts. In: 2nd International Conference “The Future of Information Sciences: INFuture2009 – Digital Resources and Knowledge Sharing”, 4-6 November 2009, Zagreb, Croatia.
|
PDF
(English)
Download (206kB) | Preview |
Abstract
Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements.
Item Type: | Published conference work (Lecture) |
---|---|
Uncontrolled Keywords: | morphosyntactic tagging, part-of-speech tagging, stochastic tagger, Multext East tagset, tagset reductions, Croatian language |
Subjects: | Information sciences > Social-humanistic informatics Information sciences > Natural language processing, lexicography and encyclopedic science Linguistics |
Departments: | Department of Linguistics Department of Information Science |
Date Deposited: | 24 Feb 2017 09:46 |
Last Modified: | 24 Feb 2017 09:46 |
URI: | http://darhiv.ffzg.unizg.hr/id/eprint/8035 |
Actions (login required)
View Item |