Latent Semantic Analysis

Ali Roozbehi

I hold a Bachelor's degree in Biomedical Engineering from Amirkabir University of Technology. I am interested in programming, neuroscience, and data analysis, and on this website, I share interesting things that I learn.

Latest Posts

23 Apr 2023 11:27 AM

A Virtual Assistant with Python a...

14 Dec 2021 3:32 PM

Word Embedding

14 Dec 2021 2:57 PM

Latent Semantic Analysis

2 Dec 2021 1:46 PM

Solving the traveling salesman pr...

6 Nov 2020 4:51 PM

Impact of Hysteresis System on Si...

Latent Semantic Analysis

14 Dec 2021 2:57 PM

Definition

Latent Semantic Analysis or LSA is one of the Topic Modeling techniques used to extract topics from text. These topics are only clusters of words and represent a concept without a specific name.

This technique uses Singular Value Decomposition (SVD) of the matrix.

Necessity

For example, consider the following two sentences:

s1: the petrol in this car is low

s2: the vehicle is short on fuel

These two sentences have similar meanings, but how can this be understood in machine language?

One way that initially seems to vectorize the two sentences and use the inner product of the two vectors to find similarities. In this case, we have:

The above matrix is called the document term matrix, and each sentence is a document, and its columns are all the words used in all sentences (documents).

Calculating the inner product of the two vectors above yields a value of about 0.3, which, given that these two sentences are almost synonymous, is a somewhat erroneous and unrealistic value.

Also, calculating this matrix for each sentence is computationally expensive, and each row of this matrix contains many zeros.

As can be seen, unlike the previous section, using vectorization and formulating sentences is not recommended in this section and does not give the desired output. Therefore, the LSA technique that uses Singular Value Decompression is used.

Singular Value Decompression

Singular Value Decomposition (SVD) is a method used to write any matrix of size m x n as UΣVT, where U and V are orthogonal matrices.

Note: In the SVD method, the values σ1 to σk are equal to the singular values of the matrix A of size m x n, or in other words, the square root of the eigenvalues of the square matrix AAT.

Note: The rank of a matrix is equal to the number of its singular values.

One of the applications of SVD is low-rank approximation of matrices. In this method, instead of all singular values of matrix A, only those with higher values are kept in matrix Σ, which can be used to approximate matrix A.

Using SVD in LSA

We can see the SVD decomposition of the document-term matrix below:

LSA implementation uses Singular Value Decomposition to obtain matrix U, where each column represents a topic. As we defined earlier, it is equal to placing the basis of matrix A next to each other. Each row represents a phrase or document.

Another advantage of this method is dimensionality reduction. Instead of using all words as a dimension, we can use only the words with higher semantic weight, which reduces noise and computational costs.

Example Implementation

Consider the following sentences:

s1: He is a good dog.

s2: The dog is too lazy.

s3: That is a brown cat.

s4: The cat is very active.

s5: I have brown cat and dog.

To implement LSA, the following steps are required:

Preprocessing:

In this stage, the data will be as follows:

Constructing the document-term matrix and the vocabulary dictionary:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', smooth_idf=True)
X = vectorizer.fit_transform(df['clean_documents'])
dictionary = vectorizer.get_feature_names()

The constructed matrix will be as follows:

Each row represents a sentence or document and each column represents a word from the dictionary as follows:

SVD decomposition:

To perform SVD decomposition, we can use the sklearn library, which provides good facilities.

from sklearn.decomposition import TruncatedSVD
# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=100, random_state=122)
lsa = svd_model.fit_transform(X)

n this function, the n_components parameter is our estimated dimension and reduces the input data dimensions to the number of entries in the input dictionary (13 in this case).
Here, our sentences are mostly about 2 topics, "linear algebra" and "python", so we also set the number to 2 to classify the sentences into 2 topics of linear algebra and python.
We print the output using the following code:

pd.options.display.float_format = '{:,.2f}'.format
topic_encoded_df = pd.DataFrame(lsa, columns = ["topic_1", "topic_2"])
topic_encoded_df["documents"] = df['documents']
topic_encoded_df[["documents", "topic_1", "topic_2"]]

Finally, the output will be as follows:

The results obtained are very interesting! As can be seen by carefully examining the values obtained, topic 1 represents linear algebra, even though it is a 2-word phrase and simple dot product could never have led us to these results. It can also be seen that the first 3 sentences, which are about linear algebra, have a higher probability of topic 1, and in the second sentence, which is somewhat related to programming, the probability of topic 2, which is related to Python, has increased. In the last 2 sentences, it can also be seen that the probability of topic 2 is dominant.

You can view the mentioned code at the following link:

https://github.com/ali-rzb/LSA