LSA is basically a SVD on the term-document occurrence matrix, X.
X= U S V.
X: Nxd
X can be approximated by taking the top K columns of U matrix( term-term) ; K row&col of S diagnal matrix (eigenvalues); K rows of V matrix (doc-doc).
U: Nxk, orthonormal matrix, i.e. UU’ = I
S: kxk
V: kxd, orthonormal matrix, i.e. VV’ = I
3 similarity can be computed:
1) term-term similarity, which can be understand as doing a correlation on all terms that are represented by document vectors, e.g. which set of docs contains the word “computer”:
= X*X’ = USV * V’ S’ U’ = U SS U’ => basically, this means the term-term similarity (correlation) can be computed by the “squaring” the K – low rank approximation of U matrix, which is termxK size.
2) doc-doc similarity: similar to 1), just a transposed way:
= X’ * X = V’ S’ U’ U S V = V’ SS V.
3) term-doc
X_apprx = USV, where X_apprx here is the low rank approximation of the SVD, i.e. the reconstruction.
Actually, one more powerful and industry related application is the concept of “pseudo document”, which can be applied to “indexing”, enabling the better query.
Basically, a query vector (Just like a new document) will be linearly transformed into the LSA space, and compared to all “pseudo document” by doing an inner product.
Dq = Xq’ * U * inv(S)
Or:
q_new = q’ * U * inv(S)
References:
http://lsa.colorado.edu/papers/dp1.LSAintro.pdf
Data:
human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1