我有一个包含3列的pandas数据框:key1, key2, document
。所有三列都是文本字段,大小为document
,范围从50个字符到5000个字符。我根据我使用(key1, key2)
scikit-learn
并设置CountVectorizer
的每个min_df
的文档集中的最低频率来识别词汇。我可以使用df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
执行此操作,其中vocab
是一个函数,我在其中计算并将词汇表(如上所定义)作为集合返回。
现在,我想使用这些词汇表(每个key1, key2
一个)来过滤相应的文档,以便每个文档只包含词汇表中的单词。我很感激能从这部分得到任何帮助。
示例数据
Input
key1 | key2 | document
aa | bb | He went home that evening. Then he had soup for dinner.
aa | bb | We want to sit down and eat dinner
cc | mm | Sometimes people eat in a restaurant
aa | bb | The culinary skills of that chef are terrible. Let us not go there.
cc | mm | People go home after dinner and try to sleep.
Vocabulary - not using counts for the purpose of this example
key1 | key2 | vocab
aa | bb | {went, evening, sit, down, culinary, chef, dinner}
cc | mm | {people, restaurant, home, dinner, sleep}
Result - only use words from corresponding vocab in document
key1 | key2 | document
aa | bb | went evening dinner
aa | bb | sit down dinner
cc | mm | people restaurant
aa | bb | culinary chef
cc | mm | people home dinner sleep
答案 0 :(得分:0)
您可以先使用merge
将列vocab
添加到第一个DataFrame
:
import re
df = df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
df = pd.merge(df1, df2, on=['key1','key2'], how='left')
#another theoretical solution
#df['vocab'] = df.groupby[['key1','key2']]['document'].transform(vocab)
然后按findall
提取所有字词,re.I
用于忽略大小写,最后删除列vocab
:
df['document'] = df['document'].str.findall('\w+', flags=re.I)
最后获得set
之间的交集,并按str.join
转换为字符串:
df['document'] = df.apply(lambda x: set(x['document']) & x['vocab'], axis=1).str.join(' ')
df = df.drop('vocab', axis=1)
print (df)
key1 key2 document
0 aa bb evening went dinner
1 aa bb sit down dinner
2 cc mm restaurant people
3 aa bb chef culinary
4 cc mm home people sleep dinner