问题:为什么sklearn的TfidfVectorizer会将分数附加到不存在的值上(即矢量化程序会创建空行)?另外,为什么分数与适当的属性不匹配?
管道:从SQL DB中提取文本数据,将文本拆分为双字节并计算每个文档的频率和每个文档的每个bigram的tf-idf,将结果加载回SQL DB。
当前状态:
引入两列数据(数字,文本)。清理文本以生成第三列cleanText:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
只删除一个单词:
data = data[data['cleanText'].str.contains(' ')]
组,然后执行特征提取:
data_grouped = data.groupby('number')
word_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tfidf_vectorizer = TfidfVectorizer()
nGrams = pd.DataFrame()
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['cleanText'])
Y = tfidf_vectorizer.fit_transform(group['cleanText'])
frequencies = sum(X).toarray()[0]
Y.todense()
tfidfscore = Y.toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
results2 = pd.DataFrame(tfidfscore, columns=['tfidfscore'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names(), columns=['nGram'])
dfinner['id'] = id
results = results.join(dfinner)
results = results2.join(results)
nGrams = nGrams.append(results)
print(nGrams)
输出:
tfidfscore frequency nGram id
0 0.57735 1.0 farmer plants 123.0
1 0.57735 1.0 plants grain 123.0
2 0.57735 NaN NaN NaN
0 0.50000 1.0 farmer son 234.0
1 0.50000 1.0 go fishing 234.0
2 0.50000 1.0 son go 234.0
3 0.50000 NaN NaN NaN
0 0.57735 1.0 catches tuna 345.0
1 0.57735 1.0 fisher catches 345.0
2 0.57735 NaN NaN NaN
问题:
tfidfscore
似乎无法匹配。似乎0.5
得分应该与数字(id)123
和数字345
相关联,因为每行中都有两个bigrams(即每个都有0.5或50%的重要性)< / LI>
醇>
为什么TfidfVectorizer会添加这些行并错误地将分数分配给数字?它与索引有关吗?任何和所有的见解将不胜感激!谢谢!
答案 0 :(得分:0)
这是一个我忽略的简单问题。 TfidfVectorizer从未使用正确的参数进行初始化,以使其按预期工作。所以我只是改变了这一行:
tfidf_vectorizer = TfidfVectorizer()
对此:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')