我有一个数据框,其中包含几个包含字符串值的列。在这些列上计算TF-IDF会返回一个数组列表,我可以将其映射回数据帧,但现在值是数组(类似于多值),这使得进一步计算变得非常困难。
我希望将这些数组列表映射到它们的功能(有点像扩展的数据帧),我可以将它直接放在原始数据帧中。
我如何实现这一目标?
示例数据:
print(d1['Keywords'])
1 APS17P, auditing standards, attestation standa...
2 APS17P, auditing standards, attestation standa...
3 AAMAAM17P, SAS No. 131, SAS No. 132, CPE, Audi...
4 AAMAAM17P, SAS No. 131, SAS No. 132, CPE, Audi...
5 APT13PHI, AICPA Professional Standards, Techni...
6 005184wz, 005184, 005186HI, 005187HI, 005188HI...
7 PAOCBOA, Special purpose framework, SPF, finan...
8 PAOCBOA, Special purpose framework, SPF, finan...
9 PAOCBOA, Special purpose framework, SPF, finan...
10 ATTNPO, Not-for-profit financial statements, N...
11 ATTNPO, Not-for-profit financial statements, N...
答案 0 :(得分:0)
这是你需要做的:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
# 1. Apply tfidf on your data
x = v.fit_transform(df['keywords'])
# 2. convert results of tfidf to a dataframe
df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
# 3. concatenate the tfidf dataframe to the original one
res = pd.concat([df, df1], axis=1)
有关详细的执行说明,请在此处查看我的答案:Append tfidf to pandas dataframe