Knjižnica Filozofskog fakulteta
Sveučilišta u Zagrebu
Faculty of Humanities and Social Sciences Institutional Repository

N-gram Overlap in Automatic Detection of Document Derivation

Downloads

Downloads per month over past year

Bosanac, Siniša and Štefanec, Vanja. (2011). N-gram Overlap in Automatic Detection of Document Derivation. In: 3rd International Conference "The Future of Information Sciences: INFuture2011 – Information Sciences and e-Society", 9-11 November 2011, Zagreb.

[img]
Preview
PDF (English)
Download (253kB) | Preview

Abstract

Establishing authenticity and independence of documents in relation to others is not a new problem, but in the era of hyper production of e-text it certainly gained even more importance. There is an increased need for automatic methods for determining originality of documents in a digital environment. The method of n-gram overlap is only one of several methods proposed by the literature and is used in a variety of systems for automatic identification of text reuse. Although the aforementioned method is quite trivial, determining the length of n-grams that would be a good indicator of text reuse is a somewhat complex issue. We assume that the optimal length of n-grams is not the same for all languages but that it depends on the particular language properties such as morphological typology, syntactic features, etc. The aim of this study is to find the optimal length of n-grams to be used for determining document derivation in Croatian language. Among the potential areas of implementation of the results of this study, we could point out automatic detection of plagiarism in academic and student papers, citation analysis, information flow tracking and event detection in on-line texts.

Item Type: Published conference work (Lecture)
Uncontrolled Keywords: document derivation, text reuse, n-gram overlap, automatic plagiarism detection, string metrics
Subjects: Information sciences > Social-humanistic informatics
Information sciences > Natural language processing, lexicography and encyclopedic science
Linguistics
Departments: Department of Information Science
Date Deposited: 19 May 2017 09:25
Last Modified: 19 May 2017 09:25
URI: http://darhiv.ffzg.unizg.hr/id/eprint/8458

Actions (login required)

View Item View Item