python - Text similarity with gensim and cosine similarity -


from gensim import corpora, models, similarities  documents = ["this book cars, dinosaurs, , fences"]  # remove common words , tokenize stoplist = set('for of , in - , is'.split()) texts = [[word word in document.lower().split() if word not in stoplist]          document in documents]  # remove commas texts[0] = [text.replace(',','') text in texts[0]]  dictionary = corpora.dictionary(texts) corpus = [dictionary.doc2bow(text) text in texts]  lsi = models.lsimodel(corpus, id2word=dictionary, num_topics=2)  doc = "i cars , birds" vec_bow = dictionary.doc2bow(doc.lower().split())  vec_lsi = lsi[vec_bow]  index = similarities.matrixsimilarity(lsi[corpus])   sims = index[vec_lsi] # perform similarity query against corpus print(sims) 

in above code comparing how "this book cars, dinosaurs, , fences" similar "i cars , birds" using cosine similarity technique.

the 2 sentences have 1 words in common, "cars", when run code 100% similar. not make sense me.

can suggest how improve code reasonable number?

these topic-modelling techniques need varied, realistic data achieve sensible results. toy-sized examples of 1 or few text examples don't work – , if do, it's luck or contrived suitability.

in particular:

  • a model 1 example can't sensibly create multiple topics, there's no contrast-between-documents model

  • a model presented words hasn't seen before ignores words, test doc appears same single word 'cars' – word it's seen before

in case, both single training document, , test document, modeled lsi having 0 contribution 0th topic, , positive contribution (of different magnitudes) 1st topic. since cosine-similarity merely compares angle, , not magnitude, both docs along-the-same-line-from-the-origin, , have no angle-of-difference, , similarity 1.0.

but if had better training data, , more single-known-word test doc, might start more sensible results. few dozen training docs, , test doc several known words, might help... hundreds or thousands or tens-of-thousands training-docs better.


Comments