Word morphing

Word Morphing

I assume you're familiar with image morphing - the process of changing one image into another through a seamless transition. So how would word morphing look like? If we want to do a seamless transition from one word into another, through which words would it pass?

In this post I'll describe how I employed word2vec's embeddings and $A^*$ search algorithm to morph between words.

In order to perform word morphing, we'll define a graph $G$ where the set of nodes $N$ represent words, and there's some weight function $f : N \times N \rightarrow \mathbb{R}_{\geq 0}$. Given a start word $n_{\text{start}}$ and an end word $n_{\text{end}}$, our goal is to find a path within the graph which minimizes the sum of weights induced by $f$:

$\text{argmin}_{\langle n_1, ..., n_k \rangle : k \in \mathbb{N}, n_1, ..., n_k \in N, n_1 = n_{\text{start}}, n_k = n_{\text{end}}}\sum_{i=1}^{k-1} f(n_i, n_{i+1})$

Usually when people talk about word morphing they mean searching for a path between $n_{\text{start}}$ and $n_{\text{end}}$ where there's an edge only between words such that one can be achieved from the other by changing one letter, as can be seen here. In this case, $f(n_1, n_2)$ is 1 when such a change exists, and $\infty$ otherwise.

In this post I'll show you how to morph between semantically similar words, i.e. $f$ will be related to semantics. Here's an example to illustrate the difference between the two approaches: given $n_{\text{start}} = \text{tooth}$, $n_{\text{end}} = \text{light}$, the approach of changing one character at a time can result in

tooth, booth, boots, botts, bitts, bitos, bigos, bigot, bight, light

while the semantics approach which will be defined in this post results in

tooth, retina, X_ray, light

Word Semantics

In order to capture word semantics we'll use pretrained word2vec embeddings. To those of you who are not familiar with the algorithm, here's an excerpt from Wikipedia:

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

It means every node in the graph can be associated with some vector in high dimensionality space (300 in our case). Therefore, we can naturally define a distance function between every two nodes. We'll use cosine similarity, since this is the metric usually used when one wants to semantically compare between word embeddings. From now on, I'll overload a node symbol $n$ to be its associated word's embedding vector.

To use the word2vec embeddings, we'll download Google's pretrained embeddings, and use gensim package to access them.

In [1]:
import os
import gzip
import shutil
import requests
import gensim
import astar
import numpy as np
In [2]:
def download_file_from_google_drive(file_id, destination):
    drive_url = 'https://docs.google.com/uc?export=download'

    session = requests.Session()
    response = session.get(drive_url, params={'id': file_id}, stream=True)
    token = get_confirm_token(response)
    
    if token:
        params = {'id': file_id, 'confirm': token}
        response = session.get(drive_url, params=params, stream=True)
    
    save_response_content(response, destination)    

    
def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

        
def save_response_content(response, dest):
    CHUNK_SIZE = 32768

    with open(dest, 'wb') as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:  # filter out keep-alive new chunks
                f.write(chunk)
    
    
def download_word2vec_if_needed(dest):
    if os.path.exists(dest):
        return
    word2vec_file_id = '0B7XkCwpI5KDYNlNUTTlSS21pQmM'
    zip_path = 'word2vec.bin.gz'
    download_file_from_google_drive(word2vec_file_id, zip_path)
    with gzip.open(zip_path, 'rb') as f_zip:
        with open(dest, 'wb') as f_dest:
            shutil.copyfileobj(f_zip, f_dest)
    os.remove(zip_path)
        

word2vec_filename = 'word2vec.bin'
download_word2vec_if_needed(word2vec_filename)

model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_filename, binary=True, limit=100000)

Choosing the Weight Function

Given the cosine similarity distance function, we can define our $f$ function to be $f(n_1, n_2) = 1 - \cos(n_1, n_2) = 1 - \frac{n_1 \cdot n_2}{||n_1|| \cdot ||n_2||}$.

However, using this approach we'll face a problem: the best path might include edges with high weight, which will result in successive words that aren't semantically similar.

To tackle this, we can change $f$ to be

$f(n_1, n_2) = \begin{cases} 1 - \cos(n_1, n_2) & \text{if } n_2 \in \text{neighbors}(n_1) \\ \infty & \text{otherwise} \end{cases}$

where $\text{neighbors}(n_1)$ denotes the nearest nodes in the graph to $n_1$ in terms of cosine similarity. The number of neighbors is configurable.

$A^*$ Search

Now that we have our graph defined, we'll use a well known search algorithm called $A^*$.

In this algorithm, every node has a cost composed of two terms - $g(n) + h(n)$.

$g(n)$ is the cost of the shortest path from $n_{\text{start}}$ to $n$, and $h(n)$ is a heuristic estimating the cost of the shortest path from $n$ to $n_{\text{end}}$. In our case, the heuristic function will be $f$.

The search algorithm maintains a data structure called the open set. Initially this set contains $n_{\text{start}}$, and at each iteration of the algorithm we pop out the node in the open set with the minimal cost $g(n) + h(n)$, and add its neighbors to the open set. The algorithm stops when the node with the minimal cost is $n_{\text{end}}$.

This algorithm works with whatever heuristic function we choose. But in order to actually find the optimal path, the heuristic function must be admissible, meaning it can't overestimate the true cost. Unfortunately, $f$ is not admissible. However, we'll use the observation that if the vectors have length 1, then the cosine similarity can be obtained using a monotonic transformation over the euclidean distance. It means the two are interchangeable in terms of ranking words by similarity. The euclidean distance is admissible (you can prove it using the triangle inequality), so we can use that instead, by defining

$f(n_1, n_2) = \begin{cases} ||n_1 - n_2|| & \text{if } n_2 \in \text{neighbors}(n_1) \\ \infty & \text{otherwise} \end{cases}$

So to sum up, we'll normalize the word embeddings, use the euclidean distance as a mean to find semantically similar words, and the same euclidean distance to direct $A^*$ search process in order to find the optimal path.

In [3]:
model.init_sims(replace=True)  # normalize the word embeddings to have length 1
In [4]:
def neighbors_fnct(node, n_neighbors, dilute_factor):
    return [neighbor for neighbor, _ in model.similar_by_word(
        node, n_neighbors * dilute_factor)][0:-1:dilute_factor]


def euclidean_dist(n1, n2):
    return np.linalg.norm(model.get_vector(n1) - model.get_vector(n2))


def is_goal_reached_fnct(current, goal):
    return euclidean_dist(current, goal) < 0.1


def morph(start, end, n_neighbors=100, dilute_factor=10):
    res = list(astar.find_path(start,
                               end,
                               neighbors_fnct=lambda node: neighbors_fnct(node, n_neighbors, dilute_factor),
                               heuristic_cost_estimate_fnct=euclidean_dist,
                               distance_between_fnct=euclidean_dist,
                               is_goal_reached_fnct=is_goal_reached_fnct))
    if res[-1] != end:
        res.append(end)
    return res

I chose $neighbors(n)$ to include its 1000 nearest nodes. However, in order to make the search more efficient, we can dilute these using dilute_factor of 10: we pick the nearest neighbor, the 10'th nearest neighbor, the 20'th, and so on - until we have 100 nodes. The intuition behind it is that the best path from some intermediate node to $n_{\text{end}}$ might pass through its nearest neighbor. If it doesn't, it might be the case that it won't pass through the second neighbor neither, since the first and second neighbors might be almost the same. So to save some computations, we just skip some of the nearest neighbors.

And here comes the fun part:

In [5]:
print morph('tooth', 'light')
print morph('John',  'perfect')
print morph('pillow', 'car')
['tooth', u'retina', u'X_ray', u'light']
['John', u'George', u'Frank_Sinatra', u'Wonderful', u'perfect']
['pillow', u'plastic_bag', u'pickup_truck', u'car']

Final thoughts

Implementing the word morphing project was fun, but not as fun as playing around and trying this tool on whatever pair of words I could have thought of. I encourage you to go ahead and play with it yourself. Let me know in the comments what interesting and surprising morphings you have foundĀ :)

Get updated of new posts


Comments !