Agić, Željko.
(2012).
Approaches to dependency parsing of Croatian texts.
PhD Thesis. Filozofski fakultet u Zagrebu, Department of Information Science.
[mentor Dovedan Han, Zdravko and Tadić, Marko].
Abstract
In the formal framework of language technologies – and the formal frameworks of respective scientific disciplines comprising it – natural language text parsing is defined as automatic syntactic analysis of its sentences or as an algorithmic procedure for unambiguous detection of syntactic roles of words in the construction of basic grammatical structures – such as sentence predicates, subjects and objects – with respect to a previously defined syntactic formalism of that specific natural language. Usefulness of natural language text parsing is reflected today in many other natural language processing tasks – such as question answering, semantic role detection and statistical machine translation – as well as in information retrieval and extraction, data mining and language research in general. This thesis investigated several approaches to data-driven dependency parsing of Croatian texts, i.e. approaches to automatic syntactic analysis of sentences written in Croatian in accordance with a predefined word-dependency-based computational model of Croatian syntax contained implicitly within a corpus of syntactically annotated Croatian texts. Parsing was firstly defined as a problem in the general domains of natural language processing and computational intelligence. By using the formal language theory framework and by defining general formal requests and evaluation criteria for natural language parsing, the problem of data-driven dependency parsing of natural language text was introduced. Two state-of-the-art general approaches to data-driven dependency parsing were described in detail, namely, graph theory based dependency parsing and transition based dependency parsing. A novel approach was envisioned and implemented specifically for dependency parsing of Croatian text by using the Croatian Dependency Treebank and a valency lexicon of Croatian verbs CROVALLEX. The approach was based on linking a graph-based data-driven dependency parser with the valency lexicon by re-ranking k-best dependency trees suggested by the data-driven module on basis of valency information encoded within the lexicon. An experiment was implemented by using the Croatian Dependency Treebank and defining a set of metrics for the evaluation of parsing accuracy and efficiency. The suggested hybrid parsing system scored the highest labeled attachment score (LAS) within the experiment, accurately parsing approximately 77.21% wordforms from the treebank. These scores were further shown to be significantly different, i.e. at least 2.68% higher than the highest scores for any of the data-driven parsing systems.
Item Type: |
PhD Thesis
|
Related URLs: |
|
Uncontrolled Keywords: |
dependency parsing, data-driven parsing, dependency syntax, Croatian language, Croatian Dependency Treebank, hybrid approach, language technologies |
Subjects: |
Information sciences > Social-humanistic informatics |
Departments: |
Department of Information Science |
Supervisor: |
Dovedan Han, Zdravko and Tadić, Marko |
Date Deposited: |
04 Oct 2013 08:48 |
Last Modified: |
09 Jul 2014 13:09 |
URI: |
http://darhiv.ffzg.unizg.hr/id/eprint/2337 |
Actions (login required)
|
View Item |