Word morphing

Posted on 08 April 2018

Word Morphing¶

In this post I'll describe how I employed word2vec's embeddings and $A^*$ search algorithm to morph between words.

In order to perform word morphing, we'll define a graph $G$ where the set of nodes $N$ represent words, and there's some weight function $f : N \times N \rightarrow \mathbb{R}_{\geq 0}$. Given a start word $n_{\text{start}}$ and an end word $n_{\text{end}}$, our goal is to find a path within the graph which minimizes the sum of weights induced by $f$:

$\text{argmin}_{\langle n_1, ..., n_k \rangle : k \in \mathbb{N}, n_1, ..., n_k \in N, n_1 = n_{\text{start}}, n_k = n_{\text{end}}}\sum_{i=1}^{k-1} f(n_i, n_{i+1})$

Usually when people talk about word morphing they mean searching for a path between $n_{\text{start}}$ and $n_{\text{end}}$ where there's an edge only between words such that one can be achieved from the other by changing one letter. In this case, $f(n_1, n_2)$ is 1 when such a change exists, and $\infty$ otherwise.

In this post I'll show how to morph between semantically similar words, i.e. $f$ will be related to semantics. Here's an example to ilustrate the difference between the two approaches: given $n_{\text{start}} = \text{tooth}$, $n_{\text{end}} = \text{light}$, the approach of changing one character at a time can result in

tooth, booth, boots, botts, bitts, bitos, bigos, bigot, bight, light

where the semantics approach which will be defined in this post will reuslt in

tooth, retina, X_ray, light

Word Semantics¶

In order to capture word semantics we'll use pretrained word2vec embeddings. To those of you who are not familiar with the algorithm, here's an excerpt from wikipedia:

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

It means every node in the graph can be associated with some vector in high dimensionality space (300 in our case). Therefore, we can naturally define a distance function between every two nodes. We'll use cosine similarity, since this is the metric usually used when one wants to semantically compare between word embeddings. From now on, I'll overload a node symbol $n$ to be its associated word's embedding vector.

To use the word2vec embeddings, we'll download Google's pretrained embeddings, and use gensim package to access them.

In [1]:
import os
import gzip
import shutil
import requests
import gensim
import astar
import numpy as np
In [2]:
def download_file_from_google_drive(file_id, destination):
    drive_url = 'https://docs.google.com/uc?export=download'

    session = requests.Session()
    response = session.get(drive_url, params={'id': file_id}, stream=True)
    token = get_confirm_token(response)
    if token:
        params = {'id': file_id, 'confirm': token}
        response = session.get(drive_url, params=params, stream=True)
    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

def save_response_content(response, dest):
    CHUNK_SIZE = 32768

    with open(dest, 'wb') as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:  # filter out keep-alive new chunks
def download_word2vec_if_needed(dest):
    if os.path.exists(dest):
    word2vec_file_id = '0B7XkCwpI5KDYNlNUTTlSS21pQmM'
    zip_path = 'word2vec.bin.gz'
    download_file_from_google_drive(word2vec_file_id, zip_path)
    with gzip.open(zip_path, 'rb') as f_zip:
        with open(dest, 'wb') as f_dest:
            shutil.copyfileobj(f_zip, f_dest)

word2vec_filename = 'word2vec.bin'

model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_filename, binary=True, limit=100000)

Choosing the Weight Function¶

Given the cosine similarity distance function, we can define our $f$ function to be $f(n_1, n_2) = 1 - \cos(n_1, n_2) = 1 - \frac{n_1 \cdot n_2}{||n_1|| \cdot ||n_2||}$.

However, using this approach we'll face a problem: the best path might include edges with high weight, which will result in successive words that aren't semantically similar.

To tackle this, we can change $f$ to be

$f(n_1, n_2) = \begin{cases} 1 - \cos(n_1, n_2) & \text{if } n_2 \in \text{neighbours}(n_1) \\ \infty & \text{otherwise} \end{cases}$

where $\text{neighbours}(n_1)$ denotes the nearest nodes in the graph to $n_1$ in terms of cosine similarity. The number of neighbours is configurable.

$A^*$ Search¶

Now that we have our graph defined, we'll use a well known search algoithm called $A^*$.

In this algorithm, every node has a cost composed of two terms - $g(n) + h(n)$.

$g(n)$ is the cost of the shortest path from $n_{\text{start}}$ to $n$, and $h(n)$ is a heuristic estimating the cost of the shortest path from $n$ to $n_{\text{end}}$. In our case, the heuristic function will be $f$.

The search algorithm maintains a data structure called the open set. Initially this set contains $n_{\text{start}}$, and at each iteration of the algorithm we pop out the node in the open set with the minimial cost $g(n) + h(n)$, and add its neighbours to the open set. The algorithm stops when the node with the minimal cost is $n_{\text{end}}$.

This algorithm works with whatever heuristic function we choose. But in order to actually find the optimal path, the heurstic function must be admissible, meaning it can't overestimate the true cost. Unfortunately, $f$ is not admissible. However, we'll use the observation that if the vectors have length 1, then the cosine similarity can be obtained using a monotonic transformation over the euclidean distance. It means the two are interchangeable in terms of ranking words by similarity. The euclidean distance is admissible (you can prove it using the triangle inequality), so we can use that instead, by defining

$f(n_1, n_2) = \begin{cases} ||n_1 - n_2|| & \text{if } n_2 \in \text{neighbours}(n_1) \\ \infty & \text{otherwise} \end{cases}$

So to sum up, we'll normalize the word embeddings, use the euclidean distance as a mean to find semantically similar words, and the same euclidean distance to direct $A^*$ search process in order to find the optimal path.

In [3]:
model.init_sims(replace=True)  # normalize the word embeddings to have length 1
In [4]:
def neighbors_fnct(node, n_neighbors, dilute_factor):
    return [neighbor for neighbor, _ in model.similar_by_word(
        node, n_neighbors * dilute_factor)][0:-1:dilute_factor]

def euclidean_dist(n1, n2):
    return np.linalg.norm(model.get_vector(n1) - model.get_vector(n2))

def is_goal_reached_fnct(current, goal):
    return euclidean_dist(current, goal) < 0.1

def morph(start, end, n_neighbors=100, dilute_factor=10):
    res = list(astar.find_path(start,
                               neighbors_fnct=lambda node: neighbors_fnct(node, n_neighbors, dilute_factor),
    if res[-1] != end:
    return res

I chose $neighbours(n)$ to include its 1000 nearest nodes. However, in order to make the search more efficient, we can dilute these using dilute_factor of 10: we pick the nearest neighbour, the 10'th nearest neighbour, the 20'th, and so on - until we have 100 nodes. The intuition behind it is that the best path from some intermediate node to $n_{\text{end}}$ might pass through its nearest neghbour. If it doesn't, it might be the case that it won't pass through the second neighbour neither, since the first and second neighbours might be almost the same. So to save some computations, we just skip some of the nearest neighbours.

And here comes the fun part:

In [5]:
print morph('tooth', 'light')
print morph('John',  'perfect')
print morph('pillow', 'car')
['tooth', u'retina', u'X_ray', u'light']
['John', u'George', u'Frank_Sinatra', u'Wonderful', u'perfect']
['pillow', u'plastic_bag', u'pickup_truck', u'car']

Comments !