from gensim import corpora, models, similarities documents = ["this book cars, dinosaurs, , fences"] # remove common words , tokenize stoplist = set('for of , in - , is'.split()) texts = [[word word in document.lower().split() if word not in stoplist] document in documents] # remove commas texts[0] = [text.replace(',','') text in texts[0]] dictionary = corpora.dictionary(texts) corpus = [dictionary.doc2bow(text) text in texts] lsi = models.lsimodel(corpus, id2word=dictionary, num_topics=2) doc = "i cars , birds" vec_bow = dictionary.doc2bow(doc.lower().split()) vec_lsi = lsi[vec_bow] index = similarities.matrixsimilarity(lsi[corpus]) sims = index[vec_lsi] # perform similarity query against corpus print(sims)
in above code comparing how "this book cars, dinosaurs, , fences" similar "i cars , birds" using cosine similarity technique.
the 2 sentences have 1 words in common, "cars", when run code 100% similar. not make sense me.
can suggest how improve code reasonable number?
these topic-modelling techniques need varied, realistic data achieve sensible results. toy-sized examples of 1 or few text examples don't work – , if do, it's luck or contrived suitability.
in particular:
a model 1 example can't sensibly create multiple topics, there's no contrast-between-documents model
a model presented words hasn't seen before ignores words, test doc appears same single word 'cars' – word it's seen before
in case, both single training document, , test document, modeled lsi having 0
contribution 0th topic, , positive contribution (of different magnitudes) 1st topic. since cosine-similarity merely compares angle, , not magnitude, both docs along-the-same-line-from-the-origin, , have no angle-of-difference, , similarity 1.0.
but if had better training data, , more single-known-word test doc, might start more sensible results. few dozen training docs, , test doc several known words, might help... hundreds or thousands or tens-of-thousands training-docs better.
Comments
Post a Comment