Andrei Lopatenko send me a note about this new book on large scale text processing with MapReduce:
http://www.umiacs.umd.edu/~jimmylin/book.html
Sunday, May 2, 2010
Saturday, March 28, 2009
Latent Semantic Analysis Tutorial
I wrote a tutorial on Latent Semantic Analysis (LSA). It can be accessed by following this link. I believe LSA is a very interesting method for ranking documents in response to a query.
LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.
The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.
An Example
Suppose we have the following set of five documents
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
and search query: dies, dagger.
A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.
However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.
The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.
Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.
LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.
The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.
If you read the tutorial I would appreciate your comments and suggestions for improvement.
LSA is a method for discovering hidden concepts in document data. Each document and term (word) is expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept.
The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.
An Example
Suppose we have the following set of five documents
d1 : Romeo and Juliet.
d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger.
d4 : “Live free or die”, that’s the New-Hampshire’s motto.
d5 : Did you know, New-Hampshire is in New-England.
and search query: dies, dagger.
A classical IR system would rank d3 to be the top of the list since it contains both dies, dagger. Then, d2 and d4 would follow, each containing a word of the query.
However, what about d1 and d5? Should they be returned as possibly interesting results to this query? A classical IR system will not return them at all. However (as humans) we know that d1 is quite related to the query. On the other hand, d5 is not so much related to the query. Thus, we would like d1 but not d5, or differently said, we want d1 to be ranked higher than d5.
The question is: Can the machine deduce this? The answer is yes, LSA does exactly that. In this example, LSA will be able to see that term dagger is related to d1 because it occurs together with the d1’s terms Romeo and Juliet, in d2 and d3, respectively.
Also, term dies is related to d1 and d5 because it occurs together with the d1’s term Romeo and d5’s term New-Hampshire in d3 and d4, respectively.
LSA will also weigh properly the discovered connections; d1 more is related to the query than d5 since d1 is “doubly” connected to dagger through Romeo and Juliet, and also connected to die through Romeo, whereas d5 has only a single connection to the query through New-Hampshire.
The theory behind LSA is that of eigenvalues, eigenvectors, and matrix decompositions. I tried to make the tutorial self-contained by reviewing some algrebra facts and giving some transformations that aren't always easy to extract from thick algebra books.
If you read the tutorial I would appreciate your comments and suggestions for improvement.
Tuesday, December 30, 2008
MapReduce
Google's MapReduce is a new parallelism framework for processing large amounts of data. Some recommended links are:
- Google's MapReduce Mini Lecture Series .
- Chapter 20.2 of the new (2nd) edition of "Database Systems: The Complete Book" by H. Garcia-Molina, J. D. Ullman, and J. Widom. (My slides based on this chapter).
- MIT MapReduce course.
- University of Washington course: Problem Solving on Large Scale Clusters (contains four interesting labs).
- Hongfei Yan's Mass Data Processing/Cloud Computing course resource links.
Subscribe to:
Posts (Atom)
