Question

我对R data.table有更多经验，但我正在尝试学习pandas。在data.table，我可以这样做：

> head(dt_m)
   event_id           device_id longitude latitude               time_ category
1:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
2:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
3:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
4:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
5:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
6:  1004583 -100015673884079572        NA       NA 1970-01-01 06:34:52   1 free
                 app_id is_active
1: -5305696816021977482         0
2: -7164737313972860089         0
3: -8504475857937456387         0
4: -8807740666788515175         0
5:  5302560163370202064         0
6:  5521284031585796822         0


dt_m_summary <- dt_m[,
                     .(
                       mean_active = mean(is_active, na.rm = TRUE)
                       , median_lat = median(latitude, na.rm = TRUE)
                       , median_lon = median(longitude, na.rm = TRUE)
                       , mean_time = mean(time_)
                       , new_col = your_function(latitude, longitude, time_)
                     )
                     , by = list(device_id, category)
                     ]

新列（mean_active到new_col）以及device_id和category将显示在dt_m_summary中。如果我想要一个具有groupby-apply结果的新列，我也可以在原始表中进行类似的by转换：

dt_m[, mean_active := mean(is_active, na.rm = TRUE), by = list(device_id, category)]

（如果我想要，例如，选择mean_active大于某个阈值的行，或做其他事情）。

我知道groupby中有pandas，但我还没有找到一种方法来进行如上所述的简单转换。我能想到的最好的是做一系列的groupby-apply，然后将结果合并到一个dataframe中，但这看起来非常笨重。有没有更好的方法呢？

Answer 1

IIUC，使用groupby和agg。有关详细信息，请参阅docs。

df = pd.DataFrame(np.random.rand(10, 2),
                  pd.MultiIndex.from_product([list('XY'), range(5)]),
                  list('AB'))

df

df.groupby(level=0).agg(['sum', 'count', 'std'])

更具针对性的例子是

# level=0 means group by the first level in the index
# if there is a specific column you want to group by
# use groupby('specific column name')
df.groupby(level=0).agg({'A': ['sum', 'std'],
                         'B': {'my_function': lambda x: x.sum() ** 2}})

注意传递给dict方法的agg包含密钥'A'和'B'。这意味着，为['sum', 'std']和'A' lambda x: x.sum() ** 2运行'B'函数（并将其标记为'my_function'）

注2 与您的new_column有关。 agg要求传递的函数将列减少为标量。您最好在groupby / agg

之前添加新列

Answer 2

@piRSquared有一个很好的答案，但在你的特定情况下，我认为你可能对使用pandas非常灵活apply function感兴趣。因为它可以一次应用于每个组，所以您可以同时对分组的DataFrame中的多个列进行操作。

def your_function(sub_df):
    return np.mean(np.cos(sub_df['latitude']) + np.sin(sub_df['longitude']) - np.tan(sub_df['time_']))

def group_function(g):
    return pd.Series([g['is_active'].mean(), g['latitude'].median(), g['longitude'].median(), g['time_'].mean(), your_function(g)], 
                     index=['mean_active', 'median_lat', 'median_lon', 'mean_time', 'new_col'])

dt_m.groupby(['device_id', 'category']).apply(group_function)

但是，我绝对同意@piRSquared，看到一个包含预期输出的完整示例会非常有帮助。

pandas：如何进行多个groupby-apply操作

2 个答案: