Using search term positions for determining document relevance [Elektronische Ressource] / vorgelegt von Patricio Galeas

English
175 Pages
Read an excerpt
Gain access to the library to view online
Learn more

Description

Using Search Term Positions for DeterminingDocument RelevanceDissertation zur Erlangung desDoktorgrades der Naturwissenschaften(Dr. rer. nat.)Vom Fachbereich Mathematik und Informatikder Philipps-Universitat¨ Marburgvorgelegt vonPatricio Galeasgeboren in Temuco - ChileMarburg2010Vom Fachbereich Mathematik und Informatik derPhilipps-Universitat¨ Marburg als Dissertation amangenommen.Erstgutachter: Prof. Dr. Bernd FreislebenZweitgutachter: Prof. Dr. Bernhard SeegerTag der mundlichen¨ Prufung:¨ 25. Juni 2010Erklarung¨Ich versichere, dass ich meine DissertationUsing Search Term Positions for Determining Document Relevanceselbstandig,¨ ohne unerlaubte Hilfe angefertigt und mich dabei keiner anderen als der vonmir ausdrucklich¨ bezeichneten Quellen und Hilfen bedient habe.Die Dissertation wurde in der jetzigen oder einer ahnlichen¨ Form noch bei keiner anderenHochschule eingereicht und hat noch keinen sonstigen Prufungszweck¨ en gedient.Marburg, denAbstractThe technological advancements in computer networks and the substantial reduction of theirproduction costs have caused a massive explosion of digitally stored information. In partic-ular, textual information is becoming increasingly available in electronic form.Finding text documents dealing with a certain topic is not a simple task. Users need toolsto sift through non-relevant information and retrieve only pieces of information relevant totheir needs [14].

Subjects

Informations

Published by
Published 01 January 2010
Reads 21
Language English
Document size 2 MB
Report a problem

Using Search Term Positions for Determining
Document Relevance
Dissertation zur Erlangung des
Doktorgrades der Naturwissenschaften
(Dr. rer. nat.)
Vom Fachbereich Mathematik und Informatik
der Philipps-Universitat¨ Marburg
vorgelegt von
Patricio Galeas
geboren in Temuco - Chile
Marburg
2010Vom Fachbereich Mathematik und Informatik der
Philipps-Universitat¨ Marburg als Dissertation am
angenommen.
Erstgutachter: Prof. Dr. Bernd Freisleben
Zweitgutachter: Prof. Dr. Bernhard Seeger
Tag der mundlichen¨ Prufung:¨ 25. Juni 2010Erklarung¨
Ich versichere, dass ich meine Dissertation
Using Search Term Positions for Determining Document Relevance
selbstandig,¨ ohne unerlaubte Hilfe angefertigt und mich dabei keiner anderen als der von
mir ausdrucklich¨ bezeichneten Quellen und Hilfen bedient habe.
Die Dissertation wurde in der jetzigen oder einer ahnlichen¨ Form noch bei keiner anderen
Hochschule eingereicht und hat noch keinen sonstigen Prufungszweck¨ en gedient.
Marburg, denAbstract
The technological advancements in computer networks and the substantial reduction of their
production costs have caused a massive explosion of digitally stored information. In partic-
ular, textual information is becoming increasingly available in electronic form.
Finding text documents dealing with a certain topic is not a simple task. Users need tools
to sift through non-relevant information and retrieve only pieces of information relevant to
their needs [14]. The traditional methods of information retrieval (IR) based on search term
frequency have somehow reached their limitations, and novel ranking methods based on
hyperlink information are not applicable to unlinked documents.
The retrieval of documents based on the positions of search terms in a document has
the potential of yielding improvements, because other terms in the environment where a
search term appears (i.e. the neighborhood) are considered. That is to say, the grammatical
type, position and frequency of other words help to clarify and specify the meaning of a
given search term [98]. However, the required additional analysis task makes position-
based methods slower than methods based on term frequency and requires more storage to
save the positions of terms. These drawbacks directly affect the performance of the most
user critical phase of the retrieval process, namely query evaluation time, which explains
the scarce use of positional information in contemporary retrieval systems.
This thesis explores the possibility of extending traditional information retrieval systems
with positional information in an efficient manner that permits us to optimize the retrieval
performance by handling term positions at query evaluation time.
To achieve this task, several abstract representation of term positions to efficiently store
and operate on term positional data are investigated. In the Gauss model, descriptive statis-
tics methods are used to estimate term positional information, because they minimize out-
liers and irregularities in the data. The Fourier model is based on Fourier series to rep-
resent positional information. In the Hilbert model, functional analysis methods are used
to provide reliable term position estimations and simple mathematical operators to handle
positional data.
The proposed models are experimentally evaluated using standard resources of the IR
research community (Text Retrieval Conference). All experiments demonstrate that the use
of positional information can enhance the quality of search results. The suggested models
outperform state-of-the-art retrieval utilities.
The term position models open new possibilities to analyze and handle textual data. For
instance, document clustering and compression of positional data based on these models
could be interesting topics to be considered in future research.
iii Abstract