Knjižnica Filozofskog fakulteta
Sveučilišta u Zagrebu
Faculty of Humanities and Social Sciences Institutional Repository

Event detection in parallel information sources

Downloads

Downloads per month over past year

Ljubešić, Nikola. (2009). Event detection in parallel information sources. PhD Thesis. Filozofski fakultet u Zagrebu, Department of Information Science. [mentor Boras, Damir].

[img]
Preview
PDF (Croatian)
Download (876kB) | Preview

Abstract

The research in this dissertation is focused on the problem of event detection in parallel information sources. The data sample used in the research contains 2,486 documents collected from 17 Croatian news portals published in a time span of three days. The sample is tagged by human annotators using an application developed for this purpose. Human annotators are given document candidates calculated in advance. The tagged sample is analyzed and two κ coefficients are calculated. Six typical clustering evaluation measures are used in the research. The F0.5 measure has proved itself as optimal for this task since it favors precision over recall. Purity is not applicable for non-partitional clustering algorithms, while NMI and RI are not suitable for this task because of the high number of true negatives. A list of variables is empirically tested. Three hierarchical clustering algorithms and one singlepass algorithm are compared. The latter is proven to be as efficient as the hierarchical ones that are more complex. Six distance measures are compared and the cosine measure is chosen as the optimal one with better results and lesser time complexity. Two heuristics concerning the time and place the documents were published are proven useful both in vitro and in vivo. From five feature weight measures the classical TF-IDF is chosen. Five methods of feature selection and extraction on the token level and four methods on higher language levels are also evaluated. In general the simpler methods on token level are more efficient for the given task than the more complex ones. A reference corpus of half a million of tokens is proven to be most efficient. By optimizing the whole procedure of event detection, an F0.5 score of ~0.82 is achieved.

Item Type: PhD Thesis
Uncontrolled Keywords: event detection, clustering, distance measures, feature weight measures, document formalization
Subjects: Information sciences > Social-humanistic informatics
Linguistics
Departments: Department of Information Science
Supervisor: Boras, Damir
Date Deposited: 18 Oct 2012 15:41
Last Modified: 09 Jul 2014 14:08
URI: http://darhiv.ffzg.unizg.hr/id/eprint/1863

Actions (login required)

View Item View Item