Skip to main content

Latent Semantic Analysis Tutorial

I wrote a tutorial on Latent Semantic Analysis (LSA). It can be accessed by following this link. I believe LSA is a very interesting method for ranking documents in response to a query.

LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.

The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.

An Example

Suppose we have the following set of five documents

d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.

and search query: dies, dagger.

A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.

However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.

The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.

Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.

LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.

The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.

If you read the tutorial I would appreciate your comments and suggestions for improvement.

Comments

  1. It is a great tutorial. It helps in understanding the complex LSA in a litte effort. Thank you very much.
    Hatem

    ReplyDelete
  2. Very well written tutorial explaining LSA starting from the concept of eigen values...

    On page 2, the following equation is not correct. The '+'s should have been ','s. Am I right?

    AS = A[x1, . . . , xn]
    = Ax1 + . . . + Axn

    On page 5, it is mentioned that S contains eigen vectors of B, and U the eigen vectors of C. Is it not the other way? please check.

    Shanmukh

    ReplyDelete
  3. Thank you, Shanmukh.

    The '+' should be '+'. This is because we have a matrix multiplication there. x1, ..., xn are vectors (they are in bold). They are the columns of the matrix on the right.

    On page 5, S is an m \times m matrix and it is the matrix of eigenvectors of B.

    Alex

    ReplyDelete
  4. Thank you! That is a really good tutorial! Easy to understand!

    Thank you again!

    ReplyDelete
  5. Such a simple and easy to understand tutorial.

    It took me hours to understand from my text book, whereas only took less than an hour reading from your tutorial.

    Thanks :D

    ReplyDelete
  6. Hi Sir,

    Page 2 says

    In other words, for symmetric matrices, S −1 is S T(which can be easily obtained) and we have A=SΛST





    I was not able to get the above line.

    S is an eigen vector thus it cannot be a symmetric matrix. And if it is not eigen vector then it has to be a symmetric matrix and the above statement doesn’t hold for all symmetric matrix(checked it for some symetric matrices) . Just wanted to clear the confusion as I could not get it.

    Regards,
    Rohit Katariya

    ReplyDelete
  7. Hello sir..

    I've read your tutorial and it is really helpful :)
    But I have several question..
    1. i'm lilbit confused at section 3.2
    It said that B=At*A have size mxm (in this case the size is 8x8), but after i calculated, the size is 5x5. Because, when u multiplying At*A (5x8*8x5), you will get 5x5 matrix size.

    2. You wrotre "matrix S is obtained from eigenvector Atranspose*A" but I read from another references that matrix S is obtained from eigenvector A*Atranspose. Can you give me explanation? Is that caused by m>n or what?

    Thankyou

    ReplyDelete

Post a Comment

Popular posts from this blog

Sanctuary Sneakers

Sanctuary Sneakers is a local early-stage startup that aggregates and leverages all available sneaker data in one place to compare prices and reveal the best deals. Their web app digs up real-time prices, inventory, and more information for both new & used sneakers from the most popular online marketplaces and stores so you don't have to! It is a perfect example of data integration from different disparate data sources. Check them out at https://sanctuarysneakers.com

SQLite Musicbrainz database

Here is a rar file with a stripped down SQLite Musicbrainz database. The compressed file is about 700 MB, so it will take around 10 min to download with a good connection. This database is for educational purposes. I have not deleted tuples from the main tables, just removed some columns and tables I thought weren't necessary to experience and understand the data. Here is a database schema diagram  and a file with create table statements and explanatory comments. Also here is a Toad data model . The raw data used for the database was downloaded from Musicbrainz on Oct 14, 2012.