请考虑以下数据框:
In [2]: import pandas as pd
In [3]: df1 = pd.DataFrame({'col1':['John', 'Felix', 'Vicki', 'Sam', 'Jack', 'Rodney'],
'col2': ['Likes tea with cookies', 'Likes tea with croissants','Likes coffee with churros',
'Likes tea with muffins','Likes beer with chicken wings','Likes coffee with donuts']})
In [4]:df1
Out[4]:
col1 col2
0 John Likes tea with cookies
1 Felix Likes tea with croissants
2 Vicki Likes coffee with churros
3 Sam Likes tea with muffins
4 Jack Likes beer with chicken wings
5 Rodney Likes coffee with donuts
当我获得value_counts()
中项目的col2
时,我得到了该系列中每个字符串的计数。不出所料,每个字符串都是唯一的,只出现一次,每个字符串的计数为1:
In [5]: df1['col2'].value_counts()
Out[5]:
Likes coffee with churros 1
Like tea with muffins 1
Likes tea with croissants 1
Likes coffee with donuts 1
Likes beer with chicken wings 1
Likes tea with cookies 1
Name: col2, dtype: int64
我想做的是:聚集value_counts()
以获取包含类似子字符串(例如Likes tea with..
和Likes coffee with..
)的字符串,并显示输出像这样:
Likes coffee with 2
Likes tea with 3
Likes beer with 1
我的数据框的一列中包含许多行,这些行具有相似的字符串(只是略有不同),并且我一直试图将包含子字符串的行合并在一起,并返回value_counts()
以及这些字符串和列中其他字符串的数量。
我的尝试:我可以像这样获得子字符串的出现次数:
In [14]:df1['col2'].str.lower().str.count("likes tea with").sum()
Out[14]: 2
但这只给了我特定子串出现的单独计数。
问题:如何在一个输出中获得所有计数,以及相似外观的字符串(例如在我的示例中)的合计计数以及所有其他字符串的计数?
答案 0 :(得分:0)
您可以拆分列并通过消除最后一个单词来应用value_counts
df1.col2 = df1.col2.replace('Like ', 'Likes ', regex = True)
df1['col2'].str.split().str[:-1].apply(' '.join).value_counts()
Likes tea with 3
Likes coffee with 2
Likes beer with chicken 1