Question

必须将一些代码从R转换为python。

在R中，使用dplyr，我们执行以下操作：

df %>%
group_by(col_a, col_b) %>%
summarise( a = sum(col_c == 'a'),
      b = sum(col_c == 'b'),
      c = b/a
)

寻找一些较早的答案，建议使用apply方法，并将我们的要求包装在函数中。创建函数的过程相当缓慢，特别是当我们不得不尝试创建多个新列进行实验时。

我们可以用类似的方式做一些与我在熊猫中给出的R示例类似的事情吗？

我实际上已经尝试过类似的方法，但是熊猫要慢得多（大约1秒，而dplyr需要200毫秒）：

只是一个例子：

df.groupby('id').agg({'out':[lambda x:sum(x==4)]})

我能够通过在分组和聚合之前过滤数据集来使其更快：

df.assign(out=df.out==4).groupby('id').agg({'out':sum})

但是，这消除了执行多个过滤器并在一行代码中进行比较的自由。即，我无法在一行中对df.out == 4和df.out == 3等进行过滤，将它们放入变量中，然后继续进行这两个比率/和。

尝试过很多Google搜索，但没有得到任何答案。

Answer 1

尝试使用下面提到的方法。我曾经遇到过您现在面临的同一问题。此方法有点冗长，但是可以快速且一次性地执行。另外，如果需要的话，它可以让您自由地真正享受幻想：)。希望对您有帮助！

#basic imports
import numpy as np
import pandas as pd

df_summarized = df.assign( #create the columns you want to summarize before grouping using 'assign'
              out_four = np.where(df.out==4,1,0),
              out_three = np.where(df.out==3,1,0)
             ).groupby(['A','B']).agg( total = ('out',np.sum),
                                   four = ('out_four',np.sum),
                                   three = ('out_three',np.sum)
             ).assign( #create more custom columns (eg. ratios) based on the output of the aggregation
             four_by_three = lambda x: x.four / x.three,
             four_by_total = lambda x: x.four / x.total,
             three_by_total = lambda x: x.three / x.total,
             #you can also get really fancy and try to add columns like these
             three_normalized = lambda x: (x.three - x.three.mean()) / x.three.std(),
             four_perc_contribution = lambda x: x.four / x.four.sum(),
             total_over_A_total = lambda x: x.total / x.groupby('A').total.transform(np.sum)
             )

Answer 2

就这么简单：

from datar.all import f, group_by, summarise, sum

df >> \
  group_by(f.col_a, f.col_b) >> \
  summarise(a = sum(f.col_c == 'a'),
            b = sum(f.col_c == 'b'),
            c = f.b/f.a
  )

我是 datar 软件包的作者。

R的group_by-> filter-> pandas等效于快速原型制作？

2 个答案: