TfidfVectorizer导致添加空行和不正确的分数分配

时间:2017-05-15 13:59:46

标签: python pandas scikit-learn tf-idf

问题:为什么sklearn的TfidfVectorizer会将分数附加到不存在的值上(即矢量化程序会创建空行)?另外,为什么分数与适当的属性不匹配?

管道:从SQL DB中提取文本数据,将文本拆分为双字节并计算每个文档的频率和每个文档的每个bigram的tf-idf,将结果加载回SQL DB。

当前状态:

引入两列数据(数字,文本)。清理文本以生成第三列cleanText:

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

只删除一个单词:

data = data[data['cleanText'].str.contains(' ')]

组,然后执行特征提取:

data_grouped = data.groupby('number')

word_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tfidf_vectorizer = TfidfVectorizer()

nGrams = pd.DataFrame()

for id, group in data_grouped:
       X = word_vectorizer.fit_transform(group['cleanText'])
       Y = tfidf_vectorizer.fit_transform(group['cleanText'])
       frequencies = sum(X).toarray()[0]
       Y.todense()
       tfidfscore = Y.toarray()[0]
       results = pd.DataFrame(frequencies, columns=['frequency'])
       results2 = pd.DataFrame(tfidfscore, columns=['tfidfscore'])
       dfinner = pd.DataFrame(word_vectorizer.get_feature_names(), columns=['nGram'])
       dfinner['id'] = id
       results = results.join(dfinner)
       results = results2.join(results)
       nGrams = nGrams.append(results)


print(nGrams)

输出:

   tfidfscore  frequency           nGram     id
0     0.57735        1.0   farmer plants  123.0
1     0.57735        1.0    plants grain  123.0
2     0.57735        NaN             NaN    NaN
0     0.50000        1.0      farmer son  234.0
1     0.50000        1.0      go fishing  234.0
2     0.50000        1.0          son go  234.0
3     0.50000        NaN             NaN    NaN
0     0.57735        1.0    catches tuna  345.0
1     0.57735        1.0  fisher catches  345.0
2     0.57735        NaN             NaN    NaN

问题:

  1. 输出包括除tfidfscore
  2. 之外的每列都具有空值的新行
  3. tfidfscore似乎无法匹配。似乎0.5得分应该与数字(id)123和数字345相关联,因为每行中都有两个bigrams(即每个都有0.5或50%的重要性)< / LI>

    为什么TfidfVectorizer会添加这些行并错误地将分数分配给数字?它与索引有关吗?任何和所有的见解将不胜感激!谢谢!

1 个答案:

答案 0 :(得分:0)

这是一个我忽略的简单问题。 TfidfVectorizer从未使用正确的参数进行初始化,以使其按预期工作。所以我只是改变了这一行:

tfidf_vectorizer = TfidfVectorizer()

对此:

tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')