Question

我正在尝试计算语料库中所选单词的 tf-idf ，但是当我对所选单词使用正则表达式时，该方法不起作用。

下面是我从stackoverflow中的另一个问题中复制的示例，并进行了一些小的更改以反映我的问题。

代码粘贴在下面。如果我分别编写“ chocolate”和“ chocolates”，则该代码有效，但如果编写“ chocolate | chocolates”，则该代码不起作用。

有人可以帮助我理解原因并提出解决方案吗？

keywords = ['tim tam', 'jam', 'fresh milk', 'chocolate|chocolates', 'biscuit pudding']
corpus = {1: "making chocolate biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
tfidf = TfidfVectorizer(vocabulary = keywords, stop_words = 'english', ngram_range=(1,3))
tfs = tfidf.fit_transform(corpus.values())
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
    print((feature_names[col], corpus_index[row]), tfs[row, col])
tfidf_results = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T

我希望结果是：

('biscuit pudding', 1) 0.652490884512534
('chocolates', 1) 0.3853716274664007
('chocolate', 1) 0.652490884512534
('chocolates', 2) 0.5085423203783267
('tim tam', 2) 0.8610369959439764
('chocolates', 3) 0.5085423203783267
('fresh milk', 3) 0.8610369959439764

但是，现在它返回：

('biscuit pudding', 1) 1.0
('tim tam', 2) 1.0
('fresh milk', 3) 1.0

Answer 1

我猜您正在使用scikit-learn中的TfidfVectorizer。如果您仔细阅读了documentation，但没有任何地方说可以在词汇表中使用正则表达式，那么您是否可以指出提及您复制的问题？

如果要手动将多个术语归为一组，则可以指定一个映射，而不是在语音中进行迭代。例如：

keywords = {'tim tam':0, 'jam':1, 'fresh milk':2, 'chocolate':3, 'chocolates':3, 'biscuit pudding':4]

请注意chocolate和chocolates都如何映射到同一索引。

词汇表中的RegEx在sklearn TfidfVectorizer中不起作用

1 个答案: