Back in the past, researchers in the field of Natural Language Processing was confused about how to represent words so the computer can understand. You also think about that right? How can we make the computer understand that word “that” is used to point something which is far from our position and how can we make computer can understand that word “this” is used to point something that is close to us?

Turns out with Neural Network, there is a magical way to do that. We can use Word2Vec to present those words in form of vectors. So your words will have a bunch of numbers – which is called a vector- that shows the position of your words in the vector space. Well, it will be a long story to talks about all of the mathematical things in here. So I will just show you how you can train your Word2Vec in here.

Vector space in 2 Dimensions

1. All you need to train Word2Vec is just a corpus. Basically, the corpus is just a document with text. You can take it from book, news, or anywhere. So let’s start by getting your corpus from a package called nltk. in nltk, they already provide you a lot of corpora so today we will use Gutenberg corpus to do the training. And from that corpus, we will just use a small document called Shakespeare-Macbeth.

import nltk'gutenberg')

macbetch_sentence = gutenberg.sents('shakespeare-macbeth.txt')

2. There are a lot of things we can do with Word2Vec. One of them is to find similar words which what I will show to you. So we need to import the model of Gensim and similarity package.

import gensim
from gensim import models, similarities

3. Train the model use this code below. It means we input the corpus, use all of the words in the corpus for training, and set the window size as 32 (you can learn the window size in the method explanation.

model = gensim.models.Word2Vec(macbetch_sentence, min_count=1, size=32)

4. The training is finished now. We can try to find the similar words for the word “there” by this code below.


5. The result will be like the picture below. As you probably thinking, this result is really dependent on how good your corpus as. Usually, the bigger the corpus is, the better the performance will be. But it’s debatable

Result of using most_similar model

Cool isn’t it? I think maybe I need to make another detailed post about Word2Vec. Hmm but maybe later haha. So tell me what do you think about this.

Categories: Python

Leave a Reply

Your email address will not be published. Required fields are marked *