如何在熊猫中组合对每个列和groupBy的迭代?

时间:2018-10-29 15:02:21

标签: python r pandas

我有一个约100列的数据框。在R中,我可以使用以下代码用与第1列中的该因子相关联的平均值替换第2-100列中的每个因子:

tmp <- NULL
for (i in seq(2,100,1)) {
tmp[[i]] <- df %>% group_by(df[[i]]) %>% mutate(mean = mean(column1)) %>%
ungroup() 

一个简单数据框的示例是:

df1:    
Column1     Column2
10          dog 
11          dog 
9           dog 
1           cat 
2           cat 
3           cat

将成为:

df2:
Column1    Column2
10         10
11         10
 9         10
 1          2
 2          2
 3          2

我的问题是如何在python中完成此操作。我尝试使用dfply包的各种组合,但无法使其成功遍历每一列,然后输出与起始数据框尺寸相同的数据框。
谢谢, 基思。

3 个答案:

答案 0 :(得分:0)

In [19]: df
Out[19]:    Column1 Column2
         0       10     dog
         1       11     dog
         2        9     dog
         3        1     cat
         4        2     cat
         5        3     cat
In [20]: df['Column2'] = df.groupby('Column2')['Column1'].transform('mean')
In [21]: df
Out[21]:    Column1  Column2
         0       10       10
         1       11       10
         2        9       10
         3        1        2
         4        2        2
         5        3        2

要遍历列,您可以执行以下操作:

for g in d:
    # Put your code here 
    print(g)

Column1
Column2

答案 1 :(得分:0)

通过结合使用@Alex的“ transform”的建议和我自己的一些技巧,我能够解决我的问题,如下所示:

list = []
df1:    
Column1     Column2    Column3
10              dog     square
11              dog     square
 9              dog     square
 1              cat     circle
 2              cat     circle
 3              cat     circle

for i in range (1,2,1):
  tmp = df.groupby([df.iloc[:,i]])["Column1"].transform('mean')
  list.append(tmp)
dfnew = pd.DataFrame(list)
dfnew = np.transpose(dfnew)

输出应为:

dfnew:    
Column1     Column2    Column3
10              10          10
11              10          10
 9              10          10
 1               2           2
 2               2           2
 3               2           2

答案 2 :(得分:0)

你不需要 for 循环来做到这一点。 across 可以处理多列。

R中:

library(dplyr) 
df1 = tribble(    
    ~Column1,  ~Column2,  ~Column3,  ~column4,  
    10,        "dog",     "square",  "pizza",  
    11,        "dog",     "square",  "pizza",  
    9,         "dog",     "circle",  "pizza",  
    1,         "cat",     "circle",  "pizza",  
    2,         "cat",     "circle",  "pie",  
    3,         "cat",     "circle",  "pie",  
)  
    
df1 %>% mutate(  
    across( 
        # columns other than Column1 
        -Column1,  
        # calculate the mean based on current column 
        ~ tibble(Column1=Column1, x=.x) %>%  
            group_by(x) %>%  
            mutate(x=mean(Column1)) %>%  
            pull(x) 
    )  
)
# A tibble: 6 x 4
  Column1 Column2 Column3 column4
    <dbl>   <dbl>   <dbl>   <dbl>
1      10      10   10.5     7.75
2      11      10   10.5     7.75
3       9      10    3.75    7.75
4       1       2    3.75    7.75
5       2       2    3.75    2.5 
6       3       2    3.75    2.5

你可以在 python 中用 datar 做类似的事情:

>>> from datar.all import f, tribble, mutate, across, group_by, mean, pull
>>> 
>>> df1 = tribble(  
...     f.Column1, f.Column2, f.Column3, f.column4,
...     10,        "dog",     "square",  "pizza",
...     11,        "dog",     "square",  "pizza",
...     9,         "dog",     "circle",  "pizza",
...     1,         "cat",     "circle",  "pizza",
...     2,         "cat",     "circle",  "pie",
...     3,         "cat",     "circle",  "pie",
... )
>>> 
>>> df1 >> mutate(
...     across(
...         ~f.Column1, 
...         lambda x: group_by(df1, x) >> mutate(x=mean(f.Column1)) >> pull(f.x)
...     )
... )
   Column1   Column2   Column3   column4
   <int64> <float64> <float64> <float64>
0       10      10.0     10.50      7.75
1       11      10.0     10.50      7.75
2        9      10.0      3.75      7.75
3        1       2.0      3.75      7.75
4        2       2.0      3.75      2.50
5        3       2.0      3.75      2.50

当然,您可以使用 for-loop

>>> from datar.all import ungroup
>>> dfnew = df1
>>> for col in df1.columns[1:]:
...     dfnew = dfnew >> group_by(col) >> mutate(**{col: mean(f.Column1)})
... 
>>> dfnew >> ungroup()
   Column1   Column2   Column3   column4
   <int64> <float64> <float64> <float64>
0       10      10.0     10.50      7.75
1       11      10.0     10.50      7.75
2        9      10.0      3.75      7.75
3        1       2.0      3.75      7.75
4        2       2.0      3.75      2.50
5        3       2.0      3.75      2.50

我是 datar 包的作者。如果您有任何问题,请随时提交问题。