使用CountVectorizer,TFIDFVectorizer计算列表之间的文本相似度

时间:2019-05-28 16:40:18

标签: python scikit-learn gensim countvectorizer tfidfvectorizer

我希望看到使用myClosureWithTupleArgVar = (((Int, Float)) -> Void)?(myClosure) TFIDFVectorizer的列表之间的相似性。

我有一个如下列表:

CountVectorizer

在这里,我想看看list1 = [['i','love','machine','learning','its','awesome'], ['i', 'love', 'coding', 'in', 'python'], ['i', 'love', 'building', 'chatbots']] list2 = ['i', 'love', 'chatbots'] 之间的相似之处,
list1[0] and list2list1[1] and list2

预期输出应类似于list1[2] and list2

1 个答案:

答案 0 :(得分:2)

来自docs TfidfVectorizer的是: “等效于CountVectorizer,后跟TfidfTransformer。”

这是代码

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "i love machine learning its awesome",
    "i love coding in python",
    "i love building chatbots",
    "i love chatbots"
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# print(vectorizer.get_feature_names())
arr = X.toarray()

以及使用余弦相似度

的答案
# similarity of yours `list1[0] and list2`  
np.dot(arr[0], arr[3]) # gives ~0.139
# similarity of yours `list1[1] and list2`  
np.dot(arr[1], arr[3]) # gives ~0.159
# similarity of yours `list1[2] and list2`  
np.dot(arr[2], arr[3]) # gives ~0.687

或使用夹克相似性CountVectorizer,我认为更接近您的期望

from sklearn.metrics import jaccard_score
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
arr = X.toarray()

jaccard_score(arr[0], arr[3]) # gives 0.5
jaccard_score(arr[1], arr[3]) # gives 0.6
jaccard_score(arr[2], arr[3]) # gives 0.9