Question

我有一个数据框和一个带字典的for循环，用于定义如何处理上一个问题中的特定列名：Pandas Generating dataframe based on columns being present

import pandas as pd

df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam',
                 'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'],
                 'Wins': [10,5,5,20,30,20,6,9,3,10],
                 'Losses': [5,5,5,2,3,2,16,20,3,12],
                 'Type': ['A','B','B','B','A','B','B','A','A','B'],
                 })

p=df.groupby('Players')



sumdict = {'Total Games': (None, 'count'),
           'Average Wins': ('Wins', 'mean'),
           'Greatest Wins': ('Wins', 'max'),
           'Unique games': ('Type', 'nunique'),
           'Max Score': ('Score', 'max')}

summary = []
for key, (column, op) in sumdict.items():
    if column is None:
        res = p.agg(op).max(axis=1)
    elif column not in df:
        continue
    else:
        res = p[column].agg(lambda x: getattr(x, op)())
    summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)

除了计算列内特定情况的apply函数外，几乎所有情况下的代码都适用：

streak = pd.DataFrame({'Streak':p.Wins.apply(lambda x: (x > 5).sum())})

有没有办法将apply函数合并到字典sumdict？

Answer 1

你有几个选择。

检查一个函数并使用它而不是getattr。
只需使用字符串，让函数通过......

IMO 2.有点清洁（虽然可能鲜为人知？），你可以g.agg("max")作为g.max()的别名。

sumdict["Streak"] = "Wins", lambda x: (x > 5).sum()

并执行以下操作，注释行是唯一的更改：

summary = []
for key, (column, op) in sumdict.items():
    if column is None:
        res = p.agg(op).max(axis=1)
    elif column not in df:
        continue
    else:
        res = p[column].agg(op)  # just use the string (or it could be a func)
    summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)

然后Streak工作得非常完美：

In [23]: summary
Out[23]:
         Greatest Wins  Total Games  Streak  Average Wins  Unique games
Players
Greg                30            4       2            11             2
Sam                 20            2       2            15             2
Steve               20            4       3            11             2

使用应用函数汇总具有不明确列的数据帧

1 个答案: