lavrenko-tutorial-LM-SIGIR03

lavrenko-tutorial-LM-SIGIR03

-

English
94 Pages
Read
Download
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Language Modeling in IRTutorial, SIGIR 2003Victor LavrenkoDepartment of Computer Science,University of Massachusetts, Amherstlavrenko@cs.umass.edu© Victor Lavrenko, Jul. 27, 2003What I hope to Accomplish• General Language Modeling framework • Discuss different methods of estimation• Discuss a few applications of LMs© Victor Lavrenko, Jul. 27, 2003Outline of the Tutorial• Introduction to Language Models– what is a language model?– how can we use language models?– what are the major issues in language modeling?• Estimation of Language Models– Basic Models, – Translation Models, – Aspect Models, – Non-parametric Models• Bayesian framework for estimation of LMs• Case study: Relevance Models© Victor Lavrenko, Jul. 27, 2003What is a Language Model?• Probability distribution over strings of text– how likely is a given string (observation) in a given “language”– for example, consider probability for the following four strings– English: p > p > p > p1 2 3 4p = P(“a quick brown dog”)1p = P(“dog quick a brown”)2p = P(“быстрая brown dog”)3p = P(“быстрая собака”)4• … depends on what “language” we are modeling– in most of this tutorial we will have p == p1 2– for some applications we will want p to be highly probable 3© Victor Lavrenko, Jul. 27, 2003Language Modeling Notation• Convenient to make explicit what we are modeling:M … “language” we are trying to models … observation (string of tokens from some vocabulary)P(s|M) … probability ...

Subjects

Informations

Published by
Reads 17
Language English
Report a problem

Language Modeling in IR
Tutorial, SIGIR 2003
Victor Lavrenko
Department of Computer Science,
University of Massachusetts, Amherst
lavrenko@cs.umass.edu
© Victor Lavrenko, Jul. 27, 2003What I hope to Accomplish
• General Language Modeling framework
• Discuss different methods of estimation
• Discuss a few applications of LMs
© Victor Lavrenko, Jul. 27, 2003Outline of the Tutorial
• Introduction to Language Models
– what is a language model?
– how can we use language models?
– what are the major issues in language modeling?
• Estimation of Language Models
– Basic Models,
– Translation Models,
– Aspect Models,
– Non-parametric Models
• Bayesian framework for estimation of LMs
• Case study: Relevance Models
© Victor Lavrenko, Jul. 27, 2003What is a Language Model?
• Probability distribution over strings of text
– how likely is a given string (observation) in a given “language”
– for example, consider probability for the following four strings
– English: p > p > p > p
1 2 3 4
p = P(“a quick brown dog”)
1
p = P(“dog quick a brown”)
2
p = P(“быстрая brown dog”)
3
p = P(“быстрая собака”)
4
• … depends on what “language” we are modeling
– in most of this tutorial we will have p == p
1 2
– for some applications we will want p to be highly probable
3
© Victor Lavrenko, Jul. 27, 2003Language Modeling Notation
• Convenient to make explicit what we are modeling:
M … “language” we are trying to model
s … observation (string of tokens from some vocabulary)
P(s|M) … probability of observing “s” in language M
• M can be thought of as a “source” or a generator
– a mechanism that can spit out strings that are legal in the language
P(s|M) … probability of getting “s” during random sampling from M
© Victor Lavrenko, Jul. 27, 2003wind
weather
hurricane
How can we use LMs in IR?
• Task: given a query, retrieve relevant documents
• Use LMs to model the process of query generation:
– user thinks of some relevant document
– picks some keywords to use as the query
Relevant Docs
Forecasters are watching two
tropical storms that could pose
hurricane threats to the southern
United States. One is a
downgraded
tropical storms
© Victor Lavrenko, Jul. 27, 2003
t
r
o
p
i
c
a
l
f
o
r
e
c
a
s
t
g
u
l
f
s
t
o
r
mLanguage Modeling for IR
• Every document in a collection defines a “language”
– consider all possible sentences (strings) that author could have
written down when creating some given document
– some are perhaps more likely to occur than others
• subject to topic, writing style, language …
– P(s|M ) … probability that author would write down string “s”
D
• think of writing a billion variations of a document and counting how many time we get “s”
• Now suppose “q” is the user’s query
– what is the probability that author would write down “q” ?
• Rank documents D in the collection by P(q|M ) [1]
D
– probability of observing “q” during random sampling from the
language model of document D
© Victor Lavrenko, Jul. 27, 2003Other applications: same idea
• Topic Detection and Tracking
– query “q” can be a topic description, or an on-topic story
– documents with high P(q|M ) probably discuss the same topic
D
• Classification / Filtering
– query can be a set of training documents for a particular class
– or testing docs can reflect observations from model of training set
• Cross-language Retrieval
– query can be in a different language from document collection
– author could have written a document in a different language
• Multi-media Retrieval
– languages don’t have to be textual (e.g. spoken or handwritten docs)
– extends to images, sounds, video, preferences, hyperlinks, …
© Victor Lavrenko, Jul. 27, 2003Is _____ a LM technique?
• How do we determine if a given model is a LM?
• LM is generative
– at some level, a language model can be used to generate text
– explicitly computes probability of observing a string of text
– Ex: probability of observing a query string from a document model
probability of observing an answer from a question model
– model an entire population
• Discriminative approaches
– model just the decision boundary
– Ex: is this document relevant?
does it belong to class X or Y
– have a lot of advantages, but these are not generative approaches
© Victor Lavrenko, Jul. 27, 2003Language Modeling: pros & cons
• Pros:
– formal mathematical model
– simple, well-understood framework
– integrates both indexing and retrieval models
– natural use of collection statistics, no heuristics
– avoids tricky issues of “relevance”, “aboutness”, etc.
• Cons:
– difficult to incorporate notions of “relevance”, user preferences
– relevance feedback / query expansion not straightforward
– can’t accommodate phrases, passages, Boolean operators
• Extensions of LM overcome some issues
– Probabilistic LSI, Relevance Models, etc…
© Victor Lavrenko, Jul. 27, 2003