I wrote a tutorial on Latent Semantic Analysis (LSA). It can be accessed by following this link. I believe LSA is a very interesting method for ranking documents in response to a query.
LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.
The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.
An Example
Suppose we have the following set of five documents
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
and search query: dies, dagger.
A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.
However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.
The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.
Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.
LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.
The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.
If you read the tutorial I would appreciate your comments and suggestions for improvement.
LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.
The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.
An Example
Suppose we have the following set of five documents
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
and search query: dies, dagger.
A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.
However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.
The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.
Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.
LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.
The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.
If you read the tutorial I would appreciate your comments and suggestions for improvement.
It is a great tutorial. It helps in understanding the complex LSA in a litte effort. Thank you very much.
ReplyDeleteHatem
Very well written tutorial explaining LSA starting from the concept of eigen values...
ReplyDeleteOn page 2, the following equation is not correct. The '+'s should have been ','s. Am I right?
AS = A[x1, . . . , xn]
= Ax1 + . . . + Axn
On page 5, it is mentioned that S contains eigen vectors of B, and U the eigen vectors of C. Is it not the other way? please check.
Shanmukh
Thank you, Shanmukh.
ReplyDeleteThe '+' should be '+'. This is because we have a matrix multiplication there. x1, ..., xn are vectors (they are in bold). They are the columns of the matrix on the right.
On page 5, S is an m \times m matrix and it is the matrix of eigenvectors of B.
Alex
Thank you! That is a really good tutorial! Easy to understand!
ReplyDeleteThank you again!
Such a simple and easy to understand tutorial.
ReplyDeleteIt took me hours to understand from my text book, whereas only took less than an hour reading from your tutorial.
Thanks :D
Hi Sir,
ReplyDeletePage 2 says
In other words, for symmetric matrices, S −1 is S T(which can be easily obtained) and we have A=SΛST
I was not able to get the above line.
S is an eigen vector thus it cannot be a symmetric matrix. And if it is not eigen vector then it has to be a symmetric matrix and the above statement doesn’t hold for all symmetric matrix(checked it for some symetric matrices) . Just wanted to clear the confusion as I could not get it.
Regards,
Rohit Katariya
Hello sir..
ReplyDeleteI've read your tutorial and it is really helpful :)
But I have several question..
1. i'm lilbit confused at section 3.2
It said that B=At*A have size mxm (in this case the size is 8x8), but after i calculated, the size is 5x5. Because, when u multiplying At*A (5x8*8x5), you will get 5x5 matrix size.
2. You wrotre "matrix S is obtained from eigenvector Atranspose*A" but I read from another references that matrix S is obtained from eigenvector A*Atranspose. Can you give me explanation? Is that caused by m>n or what?
Thankyou