Approaches to dependency parsing of Croatian texts

Statistics

Downloads

Downloads per month over past year

Agić, Željko. (2012). Approaches to dependency parsing of Croatian texts. PhD Thesis. Filozofski fakultet u Zagrebu, Department of Information Science. [mentor Dovedan Han, Zdravko and Tadić, Marko].

Preview

PDF (Croatian)
Download (4MB) | Preview

Abstract

In the formal framework of language technologies – and the formal frameworks of respective scientific disciplines comprising it – natural language text parsing is defined as automatic syntactic analysis of its sentences or as an algorithmic procedure for unambiguous detection of syntactic roles of words in the construction of basic grammatical structures – such as sentence predicates, subjects and objects – with respect to a previously defined syntactic formalism of that specific natural language. Usefulness of natural language text parsing is reflected today in many other natural language processing tasks – such as question answering, semantic role detection and statistical machine translation – as well as in information retrieval and extraction, data mining and language research in general. This thesis investigated several approaches to data-driven dependency parsing of Croatian texts, i.e. approaches to automatic syntactic analysis of sentences written in Croatian in accordance with a predefined word-dependency-based computational model of Croatian syntax contained implicitly within a corpus of syntactically annotated Croatian texts. Parsing was firstly defined as a problem in the general domains of natural language processing and computational intelligence. By using the formal language theory framework and by defining general formal requests and evaluation criteria for natural language parsing, the problem of data-driven dependency parsing of natural language text was introduced. Two state-of-the-art general approaches to data-driven dependency parsing were described in detail, namely, graph theory based dependency parsing and transition based dependency parsing. A novel approach was envisioned and implemented specifically for dependency parsing of Croatian text by using the Croatian Dependency Treebank and a valency lexicon of Croatian verbs CROVALLEX. The approach was based on linking a graph-based data-driven dependency parser with the valency lexicon by re-ranking k-best dependency trees suggested by the data-driven module on basis of valency information encoded within the lexicon. An experiment was implemented by using the Croatian Dependency Treebank and defining a set of metrics for the evaluation of parsing accuracy and efficiency. The suggested hybrid parsing system scored the highest labeled attachment score (LAS) within the experiment, accurately parsing approximately 77.21% wordforms from the treebank. These scores were further shown to be significantly different, i.e. at least 2.68% higher than the highest scores for any of the data-driven parsing systems.

Item Type:

PhD Thesis

Related URLs:

URL	URL Type
http://bib.irb.hr/prikazi-rad?&rad=587199	UNSPECIFIED

Uncontrolled Keywords:

dependency parsing, data-driven parsing, dependency syntax, Croatian language, Croatian Dependency Treebank, hybrid approach, language technologies

Subjects:

Information sciences > Social-humanistic informatics

Departments:

Department of Information Science

Supervisor:

Dovedan Han, Zdravko and Tadić, Marko

Date Deposited:

04 Oct 2013 08:48

Last Modified:

09 Jul 2014 13:09

URI:

http://darhiv.ffzg.unizg.hr/id/eprint/2337

Actions (login required)

View Item

Faculty of Humanities and Social Sciences Institutional Repository is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.