Knjižnica Filozofskog fakulteta
Sveučilišta u Zagrebu
Faculty of Humanities and Social Sciences Institutional Repository

Language Identification of Web Data for Building Linguistic Corpora


Downloads per month over past year

Stupar, Marija and Jurić, Tereza and Ljubešić, Nikola. (2011). Language Identification of Web Data for Building Linguistic Corpora. In: 3rd International Conference "The Future of Information Sciences: INFuture2011 – Information Sciences and e-Society", 9-11 November 2011, Zagreb.

PDF (English)
Download (182kB) | Preview


In this paper we inspect a series of methods for language identification on web data. We start from the standard two methods based on function word frequencies and Markov chains. We investigate the problem on both the document and the paragraph level. After obtaining an insight in the strengths and weaknesses of these basic methods, we propose two hybrid methods where the more complex one outperforms or performs equally well as the best basic method. Identifying language on each paragraph of more than three million documents collected for the Croatian Web Corpus hrWaC shows that around 96% of the documents are monolingual and that the language distribution, as expected, follows a power-law distribution.

Item Type: Published conference work (Lecture)
Uncontrolled Keywords: language identification, Web data, Croatian Web corpus, Markov model, function words
Subjects: Information sciences > Natural language processing, lexicography and encyclopedic science
Departments: Department of Information Science
Date Deposited: 01 Mar 2017 10:03
Last Modified: 01 Mar 2017 10:03

Actions (login required)

View Item View Item