‡

‡

‡

Association and Scalar Clusters Tutorial – Part 1:

Back Mapping Term Clusters to Documents

Dr. E. Garcia

admin@miislita.com

First Published: January 18, 2008. Last Modified: March 9, 2008.

Copyright ? Dr. E. Garcia, 2008. All Rights Reserved. http://www.miislita.com

Abstract

In this tutorial you will learn how to extract association and scalar clusters from a term-document matrix. A

“reaction” equation approach is used to break down the classification problem to a sequence of steps. From the

initial matrix, two similarity matrices are constructed, and from these association and scalar clusters are identified.

A back mapping technique is then used to classify documents based on their degree of pertinence to the clusters.

Matched documents are treated as distributions over topics. Applications to topic discovery, term disambiguation,

and document classification are discussed.

Keywords: association, scalar, clusters, back mapping, topic discovery, similarity matrix, document classification

Background

This tutorial is based on the Association and Scalar Clusters lecture given to graduate students enrolled in Web

Mining, Search Engines, and Business Intelligence (http://irthoughts.wordpress.com/category/web-mining-course/),

Polytechnic University of Puerto Rico (http://www.pupr.edu). Some portions are taken from a test given to the

students and from the January issue of IR Watch – The Newsletter (1, 2008).

Linear algebra calculations are represented as a sequence of “chemical reactions”. In my experience, it is easier for

students to solve linear algebra problems if they can visualize these as stepwise processes. Using reaction

equations requires of less talking, redundant explanations, and helps students organize their own thoughts and

lecture notes. Last, but not least, I like the approach since I also have a background in chemistry.

Before delving into the tutorial we need to explain the difference between association and scalar clusters and define

what is a similarity matrix M. While association clusters are based on comparing nearest neighbors (pairwise

similarity), scalar clusters are based on comparing neighborhoods (neighborhood-induced similarity). We denote

the corresponding similarity matrices by M and M . nn n

In general, a similarity matrix M is any array of the form

Figure 1. A similarity matrix M is a special type of array.

Note that M is an array with the following salient properties:

Equidimensionality # rows = # columns = trace

Non-Negativity: m 0 ij

Reflexivity: m = 1 for all i = j ij

Symmetry: m = m ij ji

Boundedness: 1 m 0 ij

With this in mind, let proceed with the tutorial.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 1 Problem

A user searches for surfing in a search engine and constructs a document collection from the top five ranked

documents. The following procedure is used.

Firstly, titles are treated as “documents” and indexed. To index the “documents” these undergo linearization,

tokenization, and filtration. Stemming is not applied. A term-document matrix A is constructed from survival terms.

To simplify the discussion, a elements of A are computed using FREQ, loosely known as the term count model, ij

a = L = f (Eq 1) ij ij ij

where

L is the weight of term i in document j. ij

f is the occurrence frequency of term i in document j. ij

Global and normalization weights are not used, but we could include these as well. See Figure 2.

Figure 2. This is a term-document matrix.

The problem now reduces to classifying terms into association and scalar clusters and identifying which documents

are more pertinent to the clusters. However, another part of the problem is how to deal with terms that are

ambiguous. For instance, surfing seems to be mentioned within different contexts across the collection. Should we

include surfing with clusters that mention internet and/or web, or with clusters that mention hawaii and/or beach?

This is a disambiguation problem.

Solution

Firstly, we reduce the problem to the following “reaction” equations:

T

A A Transpose of A

T

AA B Co-Weight Matrix

B C = M Similarity Matrix (nearest neighbor) nn

C D Row-Normalized Matrix

T

D D Transpose of D

T

DD E = M Similarity Matrix (neighborhood) n

C is a similarity matrix populated with Jaccard’s Coefficients. These coefficients estimate the degree of similarity

between pair of neighbors. Since pairs of nearest neighbors are used to generate the clusters, we could refer to C

as a nearest neighbor similarity matrix. We can denote this by stating that C is an M -type matrix; i.e. C = M . nn nn

By contrast, E is a similarity matrix populated with cosine angles (i.e., cosine similarities). Essentially we treat a row

i populated with Jaccard’s Coefficients as the neighborhood of term i. Then, rows are restated as unit vectors and

their Dot Products taken. Since for unit vectors cosine angles equal Dot Products, the result is a matrix of cosine

similarities. E can be called a neighborhood similarity. We denote this by stating that E is an M -type matrix; i.e. E = n

M . n

Clusters are then generated from C and E and these are mapped back to documents. Back mapping is aimed at

identifying in which documents all members of a given cluster co-occur. These documents should be more relevant

to a cluster than any other documents of the collection. The procedure is straightforward and consists of the

following steps.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 2 Step 1: Compute the transpose of A by changing rows into columns and columns into rows.

T

A A

T

Figure 3. Construction of A .

T

Step 2: Compute the Co-Weight matrix B by premultiplying A by A.

T

A A B

Figure 4. Construction of B.

Step 3: Compute matrix C by transforming co-weights into pairwise similarities. Use Jaccard’s Coefficients.

B C

Figure 5. Construction of C with Jaccard’s Coefficients.

It is easy to show that

J = 3/(3 + 3 – 3) = 1 11

J = 2/(3 + 2 – 2) = 0.67 12

J = 2/(3 + 4 – 2) = 0.40 13

and so forth.

We can recognize C as a similarity matrix.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 3 Step 4: Transform C into a row-normalized matrix D by converting row vectors into unit vectors.

C D

Figure 6. Row-normalized matrix D.

Step 5: Compute the transpose of D by changing rows into columns and columns into rows.

T

D D

T

Figure 7. Computing the transpose matrix D .

Step 6: Compute matrix E. Since unit vectors are used, Dot Products = Cosine similarity values.

T

D D E

Figure 8. Construction of matrix E, populated with Dot Products equal to cosine similarity values.

If we ignore the small deviations (± 0.01) due to rounding errors, we can recognize E as a similarity matrix.

Extracting Association Clusters

As mentioned, we extract association clusters (ACs) with M and scalar clusters (SCs) with M . To accomplish nn n

this, we treat each row i of C as the neighborhood of term i. Thus, elements of row i can be related to the neighbors

of term i. Our goal is then two-fold: identify the nearest neighbor(s) of term i and compare any two neighborhoods.

The assumptions here are that any two terms are similar to one another if they are nearest neighbors or if they

have similar neighborhoods.

Since row i of C represents the neighborhood of term i, for a given row, the nearest neighbor of term i is a term

other than itself with the largest similarity value. However, it might happen that a term has more than one nearest

neighbor. In that case, we include these in the cluster. For example, from row 5 of C (see Figure 5) the nearest

neighbors of t5 are t3 and t4. These terms conform the AC5 = {t5, t3, t4} association cluster.

Repeating the analysis for all rows,

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 4 t1 t2 AC1 = {t1, t2} = {internet, web}

t2 t1 AC2 = {t2, t1} = {web, internet}

t3 t5 AC3 = {t3, t5} = {surfing, beach}

t4 t5 AC4 = {t4, t5} = {hawaii, beach}

t5 t3 and t4 AC5 = {t5, t3, t4} = {beach, surfing, hawaii}

Note that row 1 gives the AC1 = {t1, t2} cluster and row 2 gives the AC2 = {t2, t1}. If order does not matter,

AC1 = AC2

and we end with just four association clusters:

AC1 = AC2 = AC1 = {t1, t2}

AC3 = {t3, t5}

AC4 = {t4, t5}

AC5 = {t5, t3, t4}

From the limited collection examined, only these association clusters are extractable from A. One might be tempted

to invoke co-occurrence arguments and propose other combination of terms, but the corresponding association

clusters are not extractable from A.

From AC3 and AC4 is clear that t3 (surfing) and t4 (hawaii) have a common nearest neighbor: t5 (beach). Across

the collection, surfing is used more in the context of beach and/or hawaii than in the context of internet and/or web.

Extracting Scalar Clusters

Scalar clusters are extracted using a similar procedure.

For example, from row 1 of E (see Figure 8) the most similar neighborhood of t1 is the neighborhood of t2. Thus,

row 1 gives the SC1 = {t1, t2} cluster.

Likewise, from row 2 of E the most similar neighborhood of t2 is the neighborhood of t1, and row 2 gives the SC2 =

{t2, t1}. Repeating the analysis for all rows

t1 t2 SC1 = {t1, t2} = {internet, web}

t2 t1 SC2 = {t2, t1} = {web, internet}

t3 t5 SC3 = {t3, t5} = {surfing, beach}

t4 t5 SC4 = {t4, t5} = {hawaii, beach}

t5 t4 SC5 = {t5, t4} = {beach, hawaii}

If order does not matter,

SC1 = SC2 = SC1 = {t1, t2}

SC3 = {t3, t5}

SC4 = SC5 = {t4, t5}

The following comparative, between association and scalar clusters, is enlightening:

Association Clusters Scalar Clusters

AC1 = AC2 = AC1 = {t1, t2} SC1 = SC2 = SC1 = {t1, t2}

AC3 = {t3, t5} SC3 = {t3, t5}

AC4 = {t4, t5} SC4 = SC5 = {t4, t5}

AC5 = {t5, t3, t4}

In AC5, t5 has two nearest neighbors, t3 and t4. But, when we include neighborhood-induced similarity the net

effect is to eliminate this degeneration. Neighborhood-induced similarity has a clear discriminatory power.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 5 Examining Neighborhood-Induced Similarity

A side-by-side comparison between M = E and M = C reveals several salient features: n nn

• All non i = j entries gain some similarity. The largest gain (0.31 units) occurs between t4 and t5.

• t1 and t2 exhibits some similarity with t4 and t5 when there was none.

• In AC5, t5 has two nearest neighbors, t3 and t4. But, when we include neighborhood-induced similarity and

calculate the scalar clusters this degeneration vanishes. Neighborhood-induced similarity has a clear

discriminatory power.

These similarity changes are examples of what we call neighborhood-induced similarity effects.

We can roughly inspect this effect by deducting M from M . Some might argue against this treatment since after n nn

all the two matrices are populated with different similarity measures (Jaccard’s Coefficients vs cosine similarities)..

A valid way of detecting the presence of neighborhood-induced similarity consists in representing the data in a

similarity space. In this manner the neighborhood of a term or a set of clusters with association and scalar

counterparts can be inspected. This is illustrated in Figure 9 for the ij pairs formed with t3 and for the corresponding

clusters.

Figure 9. Demonstration of neighborhood-induced similarity.

The high correlation coefficients indicate that the effect is not an artifact, but real. The overall trend is to gain

similarity. Lost of similarity is also possible, but not in this case.

Back Mapping Term Clusters to Documents

We are now in position of answering a cardinal question: Which documents are more relevant to the clusters?

To address this question we resource to what I call the back mapping technique. Essentially, we have constructed

clusters of features (terms) from objects (documents). In the back mapping technique we map these clusters of

features back to the objects and generate clusters of objects.

What we need to do is to look at the original term-document matrix and find in which documents all members of a

given cluster co-occur. In other words, need to find which documents are common to all members of a cluster of

terms. The identified documents conform the back mapped clusters. The assumption here is that these documents

are more pertinent to the term clusters than any other documents of the collection.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 6 Figure 10 depicts back mapping results for the generated association and scalar clusters. Note that d1 is common

to all members of AC1 and SC1. The same occurs with d3. Likewise, d4 and d5 are common to all members of

AC3 and SC3. In the case of d4, this document is also common to all members of AC4 and SC4.

Association Clusters Documents Scalar Clusters Documents

AC1 = AC2 = {t1, t2} d1, d3 SC1 = SC2 = {t1, t2} d1, d3

AC3 = {t3, t5} d4, d5 SC3 = {t3, t5} d4, d5

AC4 = {t4, t5} d4 SC4 = {t4, t5} d4

AC5 = {t5, t3, t4} d4

Figure 10. Back Mapping term clusters to documents.

.

These results are remarkable, for the following reasons.

If term clusters are taken for topics across the collection, back mapping allows us to treat documents as

distributions over such topics. Note from A that internet and surfing co-occur in d2, but this document cannot be

mapped back to any of the clusters. In this case, internet surfing is not a topic across the collection.

The case of d4 is equally interesting. This document describes a distribution over three topics coming from the

association clusters: {t3, t5}, {t4, t5}, and {t5, t3, t4} –with beach as common term. When we consider neighborhood

effects (scalar clusters), d4 describes a distribution over two topics, only: {t3, t5} and {t4, t5} –again, with beach as

common term.

Masking Effects in Similarity Matrices

From a probabilistic standpoint, a document is a distribution over topics whereas a topic is a distribution over

words. A topic can also be modeled as a cluster of terms, whose similarity expresses the strength of the topic (or

“tightness” of the cluster).

Such conceptualization promotes the notion of “strong” and “weak” topics (or “tight” and “loose” clusters). For

instance in our example, {t1, t2} = internet web is the strongest topic across the collection, exhibiting a similarity of

0.67 as an association cluster and 0.92 as a scalar cluster.

One would expect the presence of strong topics to somehow mask the presence of weak topics, making these

“invisible” to the clustering process.

To illustrate, consider Figure 11 and the t1, t3 pair co-occurring in d1 and d2. Again, assume that word order does

not matter; thus, {t3, t1} = {t1, t3}. A close look at rows 1 and 3 of the C and E similarity matrices reveals that the

similarity of this topic is surpassed by the similarities of the {t1, t2} and {t3, t5} topics.

Figure 11. Masking effects in similarity matrices:

Topics with strong similarities can mask other topics,

making these invisible to the clustering process.

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 7 The strength order is:

Row 1, C: {t1, t2} 0.67 > {t1, t3} 0.4 Row 1, E: {t1, t2} 0.92 > {t1, t3} 0.60

Row 3, C: {t3, t5} 0.5 > {t3, t1} 0.4 Row 3, E: {t3, t5} 0.75 > {t3, t1} 0.60

When we construct the association and scalar clusters the {t1, t3} topic is at a distant second place, ignored, and

“invisible” to the process. The topic is effectively masked. These results explain why the {t1, t3} cluster is not

extractable from the collection and why we cannot match d2 to any of the clusters already extracted. The masking

of the {t2, t3} = web surfing and {t3, t4} = surfing hawaii topics can be rationalized using similar arguments.

Some Practical Applications

We have shown how to extract clusters from any term-document matrix and, from these, how to identify

combination of keywords. It is time to describe some possible applications.

The back mapping technique allows us to identify which documents are more pertinent to the clusters and whether

a given document describes a distribution over specific topics. Such topics are latent across the collection of

documents. Thus, the technique can be applied to the construction of topic-specific directories.

Another application of cluster analysis is in the area of pruning. From matrix A, it is clear that d1 mentions t2 (web)

and t3 (surfing), but the {t2, t3} cluster is not extractable from A. Like internet surfing, web surfing is not a topic

across the collection. We can prune these candidate clusters.

We still have a third application. This is in the area of search engine optimization and marketing (SEO/SEM).

As part of document optimization or keyword bidding programs, some SEO and SEM specialists arbitrarily permute

terms, hoping to optimize documents and ads for all possible combination of keywords. This “optimization strategy”,

which is just a “covering all the bases” approach, can be both risky and futile. Currently, many companies spend a

considerable budget on equally questionable “optimization strategies”.

Suppose that instead of a collection of documents from the top 5 ranked documents we deal with a collection of ads

from different competitive firms. The goal is to wisely bid on specific combinations of terms. Rather than boldly bid

on all possible combination and burning a budget, we could resource to cluster analysis and arrive to educated

decisions.

Additional Work and Future Areas of Research

The analysis herein described can be refined a bit further by clustering terms based on whether these are within a

threshold interval of similarity. Such analysis should help to identify overlapping clusters.

Back mapping is a simple technique that elegantly captures patterns that might be latent (hidden) in a term-

document matrix. The reverse classification problem, that is, classifying documents into association and scalar

clusters and mapping these back to terms is also possible. A comparative analysis consisting in mapping both ways

documents and terms should shed more light into how terms, documents, and clusters are related across a

collection. In Part 2 of this tutorial I explain how this is done.

Next: Association and Scalar Clusters Tutorial – Part 2: Back Mapping Document Clusters to Terms

Exercise

1. Repeat this tutorial exercise, but this time assuming that d1 and d2 repeat surfing three times each.

Assume also that matrix A is populated with the following term weight model, where:

a = L G N ij ij i j

L = 1 + log f ij ij

G = IDF = log(D/d ) i i i

D = collection size

d = number of documents containing term i i

N = 1 j

Copyright © 2008 Dr. E. Garcia http://www.miislita.com 8