Word Embedding

Ali Roozbehi

I hold a Bachelor's degree in Biomedical Engineering from Amirkabir University of Technology. I am interested in programming, neuroscience, and data analysis, and on this website, I share interesting things that I learn.

Latest Posts

23 Apr 2023 11:27 AM

A Virtual Assistant with Python a...

14 Dec 2021 3:32 PM

Word Embedding

14 Dec 2021 2:57 PM

Latent Semantic Analysis

2 Dec 2021 1:46 PM

Solving the traveling salesman pr...

6 Nov 2020 4:51 PM

Impact of Hysteresis System on Si...

14 Dec 2021 3:32 PM

Definition

Word Embedding, in brief, involves mapping words and phrases to numerical vectors. In other words, it is reducing the dimensionality of data, which means each word has as many dimensions as its letters, and we use word embedding techniques to reduce these high dimensions to a limited number [3].

As can be seen, a large number of words can be described and coordinated with a limited number of adjectives or base words. For example, if we consider 2 words from the above base words, we will have them as coordinates:

Method

To embed words, it is necessary to first extract the most important words, which is called preprocessing in NLP.
• Removal of special characters (?, %, #, etc.)
• Removal of words with less than 3 letters
• Lowercase capital letters
• Removal of special words (is, a, and, etc.)
The code for the first 3 cases is shown below:

# Import libraries for text preprocessing
import re
import nltk

# You only need to download these resources once. After you run this 
# the first time--or if you know you already have these installed--
# you can comment these two lines out (with a #)
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words("english"))

csw = set(line.strip() for line in open('custom-stopwords.txt'))
csw = [sw.lower() for sw in csw]

stop_words = stop_words.union(csw)

# Pre-process dataset to get a cleaned and normalised text corpus
corpus = []
dataset['word_count'] = dataset[datacol].apply(lambda x: len(str(x).split(" ")))
ds_count = len(dataset.word_count)
for i in range(0, ds_count):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', str(dataset[datacol][i]))
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    # Convert to list from string
    text = text.split()
    
    # Stemming
    ps=PorterStemmer()
    
    # Lemmatisation
    lem = WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text if not word in  
            stop_words] 
    text = " ".join(text)
    corpus.append(text)

#View sample pre-processed corpus item
i = 0
print("Sentence :\t\t",        str(dataset[datacol][i]))
print("Extracted Corpuses :\t",corpus[i])

The above code extracts the main words of sentences, for example:

Separate single words or Corpus by statistics. Thus, we find the number of repetitions of each word in our database and obtain the following graph:

We also examine two-word and three-word phrases:

A desired number of the most frequently used items are considered as the secondary space dimensions, and they are called Corpus terms. Finally, the isolated words are given to the neural network, and the coordinates of each word are calculated based on the basic vectors obtained from the previous stages. For example, the following image shows the implementation of this process using one of the Word2Vec neural network architectures and considering 2 basic vectors:

As can be seen, words that have closer semantic meanings also have closer coordinates.

Vector Combination

Using this method, we can obtain normal vectors to transform states into other states. For example, in the image below, we see the four word vectors in a space with 2 bases:

Now, by subtracting the "man" vector from the "king" vector and the "woman" vector from the "queen" vector, we have:

In the same way, we can obtain vectors for different words from other words. In the following example, we use the same method to transform the country name vector into the capital name vector, masculine to feminine, and present tense verb to future tense verb [4]: