I wrote a tutorial on Latent Semantic Analysis (LSA). It can be accessed by following this link. I believe LSA is a very interesting method for ranking documents in response to a query.

LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.

The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.

An Example

Suppose we have the following set of five documents

d1 : Romeo and Juliet.

d2 : Juliet: O happy dagger!

d3 : Romeo died by dagger.

d4 : “Live free or die”, that’s the New-Hampshire’s motto.

d5 : Did you know, New-Hampshire is in New-England.

and search query: dies, dagger.

A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.

However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.

The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.

Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.

LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.

If you read the tutorial I would appreciate your comments and suggestions for improvement.

LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.

The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.

An Example

Suppose we have the following set of five documents

d1 : Romeo and Juliet.

d2 : Juliet: O happy dagger!

d3 : Romeo died by dagger.

d4 : “Live free or die”, that’s the New-Hampshire’s motto.

d5 : Did you know, New-Hampshire is in New-England.

and search query: dies, dagger.

A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.

However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.

The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.

Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.

LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.

If you read the tutorial I would appreciate your comments and suggestions for improvement.

It is a great tutorial. It helps in understanding the complex LSA in a litte effort. Thank you very much.

ReplyDeleteHatem

Very well written tutorial explaining LSA starting from the concept of eigen values...

ReplyDeleteOn page 2, the following equation is not correct. The '+'s should have been ','s. Am I right?

AS = A[x1, . . . , xn]

= Ax1 + . . . + Axn

On page 5, it is mentioned that S contains eigen vectors of B, and U the eigen vectors of C. Is it not the other way? please check.

Shanmukh

Thank you, Shanmukh.

ReplyDeleteThe '+' should be '+'. This is because we have a matrix multiplication there. x1, ..., xn are vectors (they are in bold). They are the columns of the matrix on the right.

On page 5, S is an m \times m matrix and it is the matrix of eigenvectors of B.

Alex

台中酒店S援交彩虹馬子免費影片視訊情人後宮電影主入口視訊交友av1688視訊甜心寶貝av貼片區a片免費試看avdvd85cc免費影片視訊聊天bt成人網85cc免費影片線上觀賞百分百貼影片區成人視訊一葉晴貼片免費影片線上直播aaaaa片俱樂部影片080視訊聊天室youtube影片g世代論壇線上aa片免費看av影片視訊交友ing小說論壇甜心寶貝貼片日本a片免費下載視訊交友fireup日本ab女傭影片

ReplyDelete優質的好部落格，需要大家的支持！ ........................................

ReplyDeleteThank you! That is a really good tutorial! Easy to understand!

ReplyDeleteThank you again!

Such a simple and easy to understand tutorial.

ReplyDeleteIt took me hours to understand from my text book, whereas only took less than an hour reading from your tutorial.

Thanks :D

Hi Sir,

ReplyDeletePage 2 says

In other words, for symmetric matrices, S −1 is S T(which can be easily obtained) and we have A=SΛST

I was not able to get the above line.

S is an eigen vector thus it cannot be a symmetric matrix. And if it is not eigen vector then it has to be a symmetric matrix and the above statement doesn’t hold for all symmetric matrix(checked it for some symetric matrices) . Just wanted to clear the confusion as I could not get it.

Regards,

Rohit Katariya

Hello sir..

ReplyDeleteI've read your tutorial and it is really helpful :)

But I have several question..

1. i'm lilbit confused at section 3.2

It said that B=At*A have size mxm (in this case the size is 8x8), but after i calculated, the size is 5x5. Because, when u multiplying At*A (5x8*8x5), you will get 5x5 matrix size.

2. You wrotre "matrix S is obtained from eigenvector Atranspose*A" but I read from another references that matrix S is obtained from eigenvector A*Atranspose. Can you give me explanation? Is that caused by m>n or what?

Thankyou