我有两个数据帧,df1是groupby或df.groupby('keyword')
的乘积:
df1
keyword string
A "This is a test string for the example"
"This is also a test string based on the other string"
"This string is a test string based on the other strings"
B "You can probably guess that this is also a test string"
"Yet again, another test string"
"This is also a test"
和df2
这是一个空的数据框,现在我也有一个特定值列表:
keyword_list = ['string', 'test']
基本上,我想计算keyword_list
中和df1
中的每个单词的出现频率,并且出现频率最高的单词将根据该新单词在新数据帧的特定列中附加df1中的关键字,因此df2的'A'
在df1的string
列中分配了最高的出现值。
因此理想情况下,由于'string'
是df1的A
关键字列中出现的最高值,因此将其分配为string
,依此类推。
df2
keyword High_freq_word
A "string"
B "test"
让我知道您是否需要澄清或说得通!
更新:
@ anky_91提供了一些很棒的代码,但是输出有点尴尬
df['matches'] = df.description.str.findall('|'.join(keyword_list))
df.groupby(odf.Type.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))
得到你
df1
keyword string
A "This is a test string for the example"
"This is also a test string based on the other string"
"This string is a test string based on the other strings"
B "You can probably guess that this is also a test string"
"Yet again, another test string"
"This is also a test"
但是它添加了一个新列:
matches
['string','test']
['test', 'string','string]
[etc...]
我可以找到一种方法,将其数值转换为数值,然后将该值分配给该列,但是更大的问题是将该新列附加到新的数据帧中。
由于它是一个groupby,有多个重复的值,因此我试图找到一种Python方式,将“最频繁出现的单词”映射到关键字本身,而不是基于关键字列表的整个模式。
答案 0 :(得分:3)
据我了解,您可以执行以下操作:
from itertools import chain
from scipy.stats import mode
keyword_list = ['string', 'test']
df['matches']=df.string.str.findall('|'.join(keyword_list)) #find all matches
df.groupby(df.keyword.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))
keyword
A string
B test
Name: matches, dtype: object