groupby并应用于两个数据帧

时间:2018-01-23 06:48:01

标签: python pandas pandas-groupby

我有一个包含3列的pandas数据框:key1, key2, document。所有三列都是文本字段,大小为document,范围从50个字符到5000个字符。我根据我使用(key1, key2) scikit-learn并设置CountVectorizer的每个min_df的文档集中的最低频率来识别词汇。我可以使用df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()执行此操作,其中vocab是一个函数,我在其中计算并将词汇表(如上所定义)作为集合返回。

现在,我想使用这些词汇表(每个key1, key2一个)来过滤相应的文档,以便每个文档只包含词汇表中的单词。我很感激能从这部分得到任何帮助。

示例数据

Input

key1 | key2 | document
 aa  | bb   | He went home that evening. Then he had soup for dinner.
 aa  | bb   | We want to sit down and eat dinner
 cc  | mm   | Sometimes people eat in a restaurant
 aa  | bb   | The culinary skills of that chef are terrible.  Let us not go there.
 cc  | mm   | People go home after dinner and try to sleep.


Vocabulary - not using counts for the purpose of this example

key1 | key2 | vocab
 aa  | bb   | {went, evening, sit, down, culinary, chef, dinner}
 cc  | mm   | {people, restaurant, home, dinner, sleep}

Result - only use words from corresponding vocab in document

key1 | key2 | document
 aa  | bb   | went evening dinner
 aa  | bb   | sit down dinner
 cc  | mm   | people restaurant
 aa  | bb   | culinary chef
 cc  | mm   | people home dinner sleep

1 个答案:

答案 0 :(得分:0)

您可以先使用merge将列vocab添加到第一个DataFrame

import re

df = df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
df = pd.merge(df1, df2, on=['key1','key2'], how='left')

#another theoretical solution
#df['vocab'] = df.groupby[['key1','key2']]['document'].transform(vocab)

然后按findall提取所有字词,re.I用于忽略大小写,最后删除列vocab

df['document'] = df['document'].str.findall('\w+', flags=re.I)

最后获得set之间的交集,并按str.join转换为字符串:

df['document'] = df.apply(lambda x: set(x['document']) & x['vocab'], axis=1).str.join(' ')
df = df.drop('vocab', axis=1)
print (df)
  key1 key2                  document
0   aa   bb       evening went dinner
1   aa   bb           sit down dinner
2   cc   mm         restaurant people
3   aa   bb             chef culinary
4   cc   mm  home people sleep dinner