Stupar, Marija and Jurić, Tereza and Ljubešić, Nikola. (2011). Language Identification of Web Data for Building Linguistic Corpora. In: 3rd International Conference "The Future of Information Sciences: INFuture2011 – Information Sciences and e-Society", 9-11 November 2011, Zagreb.
|
PDF
(English)
Download (182kB) | Preview |
Abstract
In this paper we inspect a series of methods for language identification on web data. We start from the standard two methods based on function word frequencies and Markov chains. We investigate the problem on both the document and the paragraph level. After obtaining an insight in the strengths and weaknesses of these basic methods, we propose two hybrid methods where the more complex one outperforms or performs equally well as the best basic method. Identifying language on each paragraph of more than three million documents collected for the Croatian Web Corpus hrWaC shows that around 96% of the documents are monolingual and that the language distribution, as expected, follows a power-law distribution.
Item Type: | Published conference work (Lecture) |
---|---|
Uncontrolled Keywords: | language identification, Web data, Croatian Web corpus, Markov model, function words |
Subjects: | Information sciences > Natural language processing, lexicography and encyclopedic science Linguistics |
Departments: | Department of Information Science |
Date Deposited: | 01 Mar 2017 10:03 |
Last Modified: | 01 Mar 2017 10:03 |
URI: | http://darhiv.ffzg.unizg.hr/id/eprint/8200 |
Actions (login required)
View Item |