Co-Occurance Matrix
This blog contains an explanation and tutorial to create word embeddings via co-occurance matrix and svd
Introduction
A computer cannot understand the language of humans. It can do a lot by understanding where and how words occur in a sentence, but it still cannot understand the semantics of a word. To map the human semantic space into something which a computer can understand, compute and work with, Word-Embedding Space is created. This is an n-dimensional vector space where each vector within it represents a word. Words similar in meaning are close together, and words that are opposites are farther apart.
Problem
The problem here is that we are given a huge .json file which contains bunch of reviews. Our goal is to create a word embedding space by it so that we can represent all the words and their meaning via n-dimensional vectors
Preparing Data
Data preparation for the task is quite simple, we will extract the reviews and tokenize them. We will not be removing stop words or lemmatizing since all high frequency words are important.
First, we read the reviews from given json file. I have used ijson, which is a library which helps in iteratively parse huge json files (instead of loading the whole file to the RAM). I have also used spacy for tokenizing since its one of the best at it.
The below code is simple to understand, I have created a class which creates our vocabulary, and tokenizes the sentences. There are various methods to do this but I felt this was a really cool and modular way to achieve this!
import ijson
import spacy
class Data:
def __init__(self):
# All the variables in uppercase are globally defined constants
self.path = PATH
self.limit = LIMIT
self.window = WINDOW
self.nlp = spacy.load(DATASET)
self.word2index = {}
self.index2word = {}
self.vocab = []
self.index = 0
def getSentences(self):
return self.sentences
def getVocab(self):
return self.word2index, self.index2word, self.index, self.vocab
def readJSON(self, limit, path):
# Reading sentences line by line
sentences = []
f = open(path)
for item in ijson.items(f, 'reviewText', multiple_reviews=True):
sentences.append(item)
if len(sentences) == limit:
break
return sentences
def addWords(self, word):
# Adding words to vocabulary (if they don't exist in it previously)
if word not in self.word2index:
self.word2index[word] = self.index
self.index2word[self.index] = word
self.vocab.append(word)
self.index += 1
def createVocab(self, sentences):
# Going line by line and token by token
for sent in sentences:
words = self.nlp(sent)
for word in words:
self.addWords(word.text)
def handle(self):
self.sentences = self.readJSON(self.limit, self.path)
self.createVocab(self.sentences)
Creating Co-Occurance Matrix
Usually, the limit of numpy matrices are around the region of 1e5*1e5. In some cases, the data might exceed this limit, hence I have used sparse matrices (so that all the cases are covered). Sparse matrices are able to hold larger matrices given that they are sparse in nature (mostly 0s).
I have used lil_matrix to create the sparse matrix, you can also use csr_matrix instead.
I have also used svds, which is a function which returns the Singular Value Decomposition matrices given a matrix and is optimised to work on sparse matrices. You can also use truncated_svd in place of it.
import numpy as np
from scipy.sparse.linalg import svds
from scipy.sparse import lil_matrix
class Matrix:
def __init__(self, sentences, nlp, word2index, index2word, index, vocab):
# Window refers to the window size for calculating co-occurances
# K refers to the truncated value of svd we will be taking
self.sentences = sentences
self.nlp = nlp
self.window = WINDOW
self.word2index = word2index
self.index2word = index2word
self.index = index
self.vocab = vocab
self.k = K
def getEmbeddings(self):
return self.embeddings
def getMatrix(self):
return self.matrix
def wordEmbeddings(self, word):
return self.embeddings[self.word2index[word]]
def moveWindow(self, matrix, window, sentences):
# Going window to window to increment all co-occuring words
for sent in sentences:
words = self.nlp(sent)
for i in range(0, len(words) - window + 1):
for j in range(i + 1, i + window + 1):
if j < len(words):
i_index = self.word2index[words[i].text]
j_index = self.word2index[words[j].text]
matrix[i_index, j_index] += 1
matrix[j_index, i_index] += 1
return matrix
def getSVD(self, matrix):
# Performing Single Value Decomposition on matrix and taking top k columns
# The word embeddings is the right matrix, and hence we return only that
right, sigma, left_t = svds(matrix, self.k)
return right
def handle(self):
self.matrix = lil_matrix((self.index, self.index), dtype=np.float)
self.matrix = self.moveWindow(self.matrix, self.window, self.sentences)
self.embeddings = self.getSVD(self.matrix)