Latent Semantic Analysis Tutorial

I wrote a tutorial on Latent Semantic Analysis (LSA). It can be accessed by following this link. I believe LSA is a very interesting method for ranking documents in response to a query.

LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.

The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.

An Example

Suppose we have the following set of five documents

d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.

and search query: dies, dagger.

A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.

However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.

The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.

Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.

LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.

If you read the tutorial I would appreciate your comments and suggestions for improvement.

Comments

AnonymousAugust 14, 2009 at 10:40 AM
It is a great tutorial. It helps in understanding the complex LSA in a litte effort. Thank you very much.
Hatem
AnonymousDecember 17, 2009 at 11:43 PM
Very well written tutorial explaining LSA starting from the concept of eigen values...

On page 2, the following equation is not correct. The '+'s should have been ','s. Am I right?

AS = A[x1, . . . , xn]
= Ax1 + . . . + Axn

On page 5, it is mentioned that S contains eigen vectors of B, and U the eigen vectors of C. Is it not the other way? please check.

Shanmukh
Alex ThomoDecember 19, 2009 at 10:29 AM
Thank you, Shanmukh.

The '+' should be '+'. This is because we have a matrix multiplication there. x1, ..., xn are vectors (they are in bold). They are the columns of the matrix on the right.

On page 5, S is an m \times m matrix and it is the matrix of eigenvectors of B.

Alex
AnonymousJanuary 19, 2011 at 7:44 AM
Thank you! That is a really good tutorial! Easy to understand!

Thank you again!
WahyuApril 15, 2012 at 5:44 AM
Such a simple and easy to understand tutorial.

It took me hours to understand from my text book, whereas only took less than an hour reading from your tutorial.

Thanks :D
UnknownSeptember 1, 2016 at 1:16 AM
Hi Sir,

Page 2 says

In other words, for symmetric matrices, S −1 is S T(which can be easily obtained) and we have A=SΛST

I was not able to get the above line.

S is an eigen vector thus it cannot be a symmetric matrix. And if it is not eigen vector then it has to be a symmetric matrix and the above statement doesn’t hold for all symmetric matrix(checked it for some symetric matrices) . Just wanted to clear the confusion as I could not get it.

Regards,
Rohit Katariya
AnonymousOctober 19, 2016 at 11:09 PM
Hello sir..

I've read your tutorial and it is really helpful :)
But I have several question..
1. i'm lilbit confused at section 3.2
It said that B=At*A have size mxm (in this case the size is 8x8), but after i calculated, the size is 5x5. Because, when u multiplying At*A (5x8*8x5), you will get 5x5 matrix size.

2. You wrotre "matrix S is obtained from eigenvector Atranspose*A" but I read from another references that matrix S is obtained from eigenvector A*Atranspose. Can you give me explanation? Is that caused by m>n or what?

Thankyou

Alex Thomo's Blog

Search This Blog

Latent Semantic Analysis Tutorial

Labels

Comments

Post a Comment

Popular posts from this blog

Sanctuary Sneakers

SQLite Musicbrainz database