一系列包含子字符串的字符串的总计数

时间:2019-02-27 19:03:45

标签: pandas

请考虑以下数据框:

In [2]: import pandas as pd

In [3]: df1 = pd.DataFrame({'col1':['John', 'Felix', 'Vicki', 'Sam', 'Jack', 'Rodney'], 
'col2': ['Likes tea with cookies', 'Likes tea with croissants','Likes coffee with churros',
'Likes tea with muffins','Likes beer with chicken wings','Likes coffee with donuts']})

In [4]:df1
Out[4]: 
     col1                           col2
0    John         Likes tea with cookies
1   Felix      Likes tea with croissants
2   Vicki      Likes coffee with churros
3     Sam          Likes tea with muffins
4    Jack  Likes beer with chicken wings
5  Rodney       Likes coffee with donuts

当我获得value_counts()中项目的col2时,我得到了该系列中每个字符串的计数。不出所料,每个字符串都是唯一的,只出现一次,每个字符串的计数为1:

In [5]: df1['col2'].value_counts()
Out[5]: 
Likes coffee with churros        1
Like tea with muffins            1
Likes tea with croissants        1
Likes coffee with donuts         1
Likes beer with chicken wings    1
Likes tea with cookies           1
Name: col2, dtype: int64

我想做的是:聚集value_counts()以获取包含类似子字符串(例如Likes tea with..Likes coffee with..)的字符串,并显示输出像这样:

Likes coffee with     2
Likes tea with        3
Likes beer with       1

我的数据框的一列中包含许多行,这些行具有相似的字符串(只是略有不同),并且我一直试图将包含子字符串的行合并在一起,并返回value_counts()以及这些字符串和列中其他字符串的数量。

我的尝试:我可以像这样获得子字符串的出现次数:

In [14]:df1['col2'].str.lower().str.count("likes tea with").sum()
Out[14]: 2

但这只给了我特定子串出现的单独计数。

问题:如何在一个输出中获得所有计数,以及相似外观的字符串(例如在我的示例中)的合计计数以及所有其他字符串的计数?

1 个答案:

答案 0 :(得分:0)

您可以拆分列并通过消除最后一个单词来应用value_counts

df1.col2 = df1.col2.replace('Like ', 'Likes ', regex = True)
df1['col2'].str.split().str[:-1].apply(' '.join).value_counts()

Likes tea with             3
Likes coffee with          2
Likes beer with chicken    1