Python text document similarities (w/o libraries) [closed]

Here some ideas:

  1. use new_str = str.upper() so beer and Beer will be same (if you
    need this)
  2. use list = str.split() to make a list of the words
    in your string.
  3. use set = set(list) to get rid of double words
    if needed.
  4. start with an empty word_list. Copy the first set in the word_list. In the following steps you can loop over the entries in your set and check if they are part of your word_list.

for word in set:
if word not in word_list:
word_list.append(word)

  1. Now you can make a multi-hot vector from your sentence. (1 if word_list[i] in sentence else 0)
  2. Don’t forget to make your multi-hot vectors longer (additional zeros) if you add a word to word_list.
  3. last step: make a matrix from your vectors.

Leave a Comment