Question

我有一个我想在整个数据框中的特定列中计算单词的数据框。

假设shape是数据框中的一列：

shape                             color
circle rectangle                  orange
square triangle 
rombus  



square oval                       black
triangle circle

rectangle oval                    white
triangle

我想在shape栏中计算数据框中有多少圆形，矩形，椭圆形和三角形。

输出应为：

circle    2
rectangle 2
triangle  3
oval      1

Answer 1

使用：

L = ['circle','rectangle','oval','triangle']

s = df['shape'].astype(str).str.split(expand=True).stack()
df = s[s.isin(L)].value_counts().reindex(L, fill_value=0).reset_index()
df.columns = ['vals','counts']
print (df)
        vals  counts
0     circle       2
1  rectangle       2
2       oval       2
3   triangle       3

说明：

第split个空格（默认分隔符），stack个单词Series
按list中的值按isin过滤
要进行计数，请使用value_counts
如有必要，请更改顺序或使用0添加缺少的值并添加reindex
对于DataFrame中的Series，添加reset_index

Answer 2

您可以join的{{1}}列带有空格，并'shape'作为结果。将其传递给顶级函数split并使用pandas.value_counts来将其子集化为您想要看到的形状。

reindex的优点是，如果reindex列中没有所需的形状之一，则返回nan。

'shape'

如果您期望数据集中可能缺少形状，则还可以提供shapes = ['circle','rectangle','oval','triangle'] pd.value_counts(' '.join(df['shape']).split()).reindex(shapes) circle 2 rectangle 2 oval 2 triangle 3 dtype: int64填充值。在下面，我选择用reindex填充它。

Answer 3

分割字符串后，可以将collections.Counter与itertools.chain一起使用：

df = pd.DataFrame({'shape': ['circle rectangle', 'square triangle',
                             'rombus', 'square oval', 'triangle circle',
                             'rectangle oval', 'triangle']})

from collections import Counter
from itertools import chain

c = Counter(chain.from_iterable(df['shape'].str.split()))

print(c)

Counter({'triangle': 3, 'circle': 2, 'rectangle': 2,
         'square': 2, 'oval': 2, 'rombus': 1})

这将提供Counter对象，该对象是dict的子类。如果您希望过滤关键字，则可以通过字典理解来实现：

L = {'circle', 'rectangle', 'oval', 'triangle'}

res = {k: v for k, v in c.items() if k in L}

print(res)

{'circle': 2, 'oval': 2, 'rectangle': 2, 'triangle': 3}

如何计算数据框中所有等于条件的选定单词？

3 个答案: