Language Identification of Web Data for Building Linguistic Corpora

Statistics

Downloads

Downloads per month over past year

Stupar, Marija and Jurić, Tereza and Ljubešić, Nikola. (2011). Language Identification of Web Data for Building Linguistic Corpora. In: 3rd International Conference "The Future of Information Sciences: INFuture2011 – Information Sciences and e-Society", 9-11 November 2011, Zagreb.

Preview

PDF (English)
Download (182kB) | Preview

Official URL: http://infoz.ffzg.hr/INFuture/2011/papers/INFuture2011.pdf

Abstract

In this paper we inspect a series of methods for language identification on web data. We start from the standard two methods based on function word frequencies and Markov chains. We investigate the problem on both the document and the paragraph level. After obtaining an insight in the strengths and weaknesses of these basic methods, we propose two hybrid methods where the more complex one outperforms or performs equally well as the best basic method. Identifying language on each paragraph of more than three million documents collected for the Croatian Web Corpus hrWaC shows that around 96% of the documents are monolingual and that the language distribution, as expected, follows a power-law distribution.

Item Type:	Published conference work (Lecture)
Uncontrolled Keywords:	language identification, Web data, Croatian Web corpus, Markov model, function words
Subjects:	Information sciences > Natural language processing, lexicography and encyclopedic science Linguistics
Departments:	Department of Information Science
Date Deposited:	01 Mar 2017 10:03
Last Modified:	01 Mar 2017 10:03
URI:	http://darhiv.ffzg.unizg.hr/id/eprint/8200

Actions (login required)

View Item

Faculty of Humanities and Social Sciences Institutional Repository is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.