Knjižnica Filozofskog fakulteta
Sveučilišta u Zagrebu
Faculty of Humanities and Social Sciences Institutional Repository

Approaches to dependency parsing of Croatian texts

Downloads

Downloads per month over past year

Agić, Željko. (2012). Approaches to dependency parsing of Croatian texts. PhD Thesis. Filozofski fakultet u Zagrebu, Department of Information Science. [mentor Dovedan Han, Zdravko and Tadić, Marko].

[img]
Preview
PDF (Croatian)
Download (4MB) | Preview

Abstract

In the formal framework of language technologies – and the formal frameworks of respective scientific disciplines comprising it – natural language text parsing is defined as automatic syntactic analysis of its sentences or as an algorithmic procedure for unambiguous detection of syntactic roles of words in the construction of basic grammatical structures – such as sentence predicates, subjects and objects – with respect to a previously defined syntactic formalism of that specific natural language. Usefulness of natural language text parsing is reflected today in many other natural language processing tasks – such as question answering, semantic role detection and statistical machine translation – as well as in information retrieval and extraction, data mining and language research in general. This thesis investigated several approaches to data-driven dependency parsing of Croatian texts, i.e. approaches to automatic syntactic analysis of sentences written in Croatian in accordance with a predefined word-dependency-based computational model of Croatian syntax contained implicitly within a corpus of syntactically annotated Croatian texts. Parsing was firstly defined as a problem in the general domains of natural language processing and computational intelligence. By using the formal language theory framework and by defining general formal requests and evaluation criteria for natural language parsing, the problem of data-driven dependency parsing of natural language text was introduced. Two state-of-the-art general approaches to data-driven dependency parsing were described in detail, namely, graph theory based dependency parsing and transition based dependency parsing. A novel approach was envisioned and implemented specifically for dependency parsing of Croatian text by using the Croatian Dependency Treebank and a valency lexicon of Croatian verbs CROVALLEX. The approach was based on linking a graph-based data-driven dependency parser with the valency lexicon by re-ranking k-best dependency trees suggested by the data-driven module on basis of valency information encoded within the lexicon. An experiment was implemented by using the Croatian Dependency Treebank and defining a set of metrics for the evaluation of parsing accuracy and efficiency. The suggested hybrid parsing system scored the highest labeled attachment score (LAS) within the experiment, accurately parsing approximately 77.21% wordforms from the treebank. These scores were further shown to be significantly different, i.e. at least 2.68% higher than the highest scores for any of the data-driven parsing systems.

Item Type: PhD Thesis
Uncontrolled Keywords: dependency parsing, data-driven parsing, dependency syntax, Croatian language, Croatian Dependency Treebank, hybrid approach, language technologies
Subjects: Information sciences > Social-humanistic informatics
Departments: Department of Information Science
Supervisor: Dovedan Han, Zdravko and Tadić, Marko
Date Deposited: 04 Oct 2013 08:48
Last Modified: 09 Jul 2014 13:09
URI: http://darhiv.ffzg.unizg.hr/id/eprint/2337

Actions (login required)

View Item View Item