Calculate cosine similarity given 2 sentence strings

A simple pure-Python implementation would be:

import math
import re
from collections import Counter

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

print("Cosine:", cosine)

Prints:

Cosine: 0.861640436855

The cosine formula used here is described here.

This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.

You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.

Leave a Comment