我有一个约100列的数据框。在R中,我可以使用以下代码用与第1列中的该因子相关联的平均值替换第2-100列中的每个因子:
tmp <- NULL
for (i in seq(2,100,1)) {
tmp[[i]] <- df %>% group_by(df[[i]]) %>% mutate(mean = mean(column1)) %>%
ungroup()
一个简单数据框的示例是:
df1:
Column1 Column2
10 dog
11 dog
9 dog
1 cat
2 cat
3 cat
将成为:
df2:
Column1 Column2
10 10
11 10
9 10
1 2
2 2
3 2
我的问题是如何在python中完成此操作。我尝试使用dfply包的各种组合,但无法使其成功遍历每一列,然后输出与起始数据框尺寸相同的数据框。
谢谢,
基思。
答案 0 :(得分:0)
In [19]: df
Out[19]: Column1 Column2
0 10 dog
1 11 dog
2 9 dog
3 1 cat
4 2 cat
5 3 cat
In [20]: df['Column2'] = df.groupby('Column2')['Column1'].transform('mean')
In [21]: df
Out[21]: Column1 Column2
0 10 10
1 11 10
2 9 10
3 1 2
4 2 2
5 3 2
要遍历列,您可以执行以下操作:
for g in d:
# Put your code here
print(g)
Column1
Column2
答案 1 :(得分:0)
通过结合使用@Alex的“ transform”的建议和我自己的一些技巧,我能够解决我的问题,如下所示:
list = []
df1:
Column1 Column2 Column3
10 dog square
11 dog square
9 dog square
1 cat circle
2 cat circle
3 cat circle
for i in range (1,2,1):
tmp = df.groupby([df.iloc[:,i]])["Column1"].transform('mean')
list.append(tmp)
dfnew = pd.DataFrame(list)
dfnew = np.transpose(dfnew)
输出应为:
dfnew:
Column1 Column2 Column3
10 10 10
11 10 10
9 10 10
1 2 2
2 2 2
3 2 2
答案 2 :(得分:0)
你不需要 for 循环来做到这一点。 across
可以处理多列。
在R
中:
library(dplyr)
df1 = tribble(
~Column1, ~Column2, ~Column3, ~column4,
10, "dog", "square", "pizza",
11, "dog", "square", "pizza",
9, "dog", "circle", "pizza",
1, "cat", "circle", "pizza",
2, "cat", "circle", "pie",
3, "cat", "circle", "pie",
)
df1 %>% mutate(
across(
# columns other than Column1
-Column1,
# calculate the mean based on current column
~ tibble(Column1=Column1, x=.x) %>%
group_by(x) %>%
mutate(x=mean(Column1)) %>%
pull(x)
)
)
# A tibble: 6 x 4
Column1 Column2 Column3 column4
<dbl> <dbl> <dbl> <dbl>
1 10 10 10.5 7.75
2 11 10 10.5 7.75
3 9 10 3.75 7.75
4 1 2 3.75 7.75
5 2 2 3.75 2.5
6 3 2 3.75 2.5
你可以在 python 中用 datar
做类似的事情:
>>> from datar.all import f, tribble, mutate, across, group_by, mean, pull
>>>
>>> df1 = tribble(
... f.Column1, f.Column2, f.Column3, f.column4,
... 10, "dog", "square", "pizza",
... 11, "dog", "square", "pizza",
... 9, "dog", "circle", "pizza",
... 1, "cat", "circle", "pizza",
... 2, "cat", "circle", "pie",
... 3, "cat", "circle", "pie",
... )
>>>
>>> df1 >> mutate(
... across(
... ~f.Column1,
... lambda x: group_by(df1, x) >> mutate(x=mean(f.Column1)) >> pull(f.x)
... )
... )
Column1 Column2 Column3 column4
<int64> <float64> <float64> <float64>
0 10 10.0 10.50 7.75
1 11 10.0 10.50 7.75
2 9 10.0 3.75 7.75
3 1 2.0 3.75 7.75
4 2 2.0 3.75 2.50
5 3 2.0 3.75 2.50
当然,您可以使用 for-loop
:
>>> from datar.all import ungroup
>>> dfnew = df1
>>> for col in df1.columns[1:]:
... dfnew = dfnew >> group_by(col) >> mutate(**{col: mean(f.Column1)})
...
>>> dfnew >> ungroup()
Column1 Column2 Column3 column4
<int64> <float64> <float64> <float64>
0 10 10.0 10.50 7.75
1 11 10.0 10.50 7.75
2 9 10.0 3.75 7.75
3 1 2.0 3.75 7.75
4 2 2.0 3.75 2.50
5 3 2.0 3.75 2.50
我是 datar
包的作者。如果您有任何问题,请随时提交问题。