我在数据框中有一些像下面这样的输入数据。
id priority owner goal changed_time delta_time
1 P1 bob 40 2016-11-02 20
2 P2 bob 20 2016-11-02 10
3 P3 bob 30 2016-11-02 50
4 P1 alice 20 2016-10-02 70
5 P1 bob 40 2016-10-02 05
6 P1 bob 40 2016-10-02 24
7 P3 alice 40 2016-09-02 34
8 P1 bob 40 2016-09-02 20
9 P2 ross 40 2016-09-02 10
10 P1 bob 40 2016-11-02 20
11 P2 sec 40 2016-09-02 34
12 P3 bob 30 2016-11-02 90
我想要如下输出。我正在做的左侧分组
df[['owner','priority','Goal','Delta_time']].groupby(['productowner','priority']).mean()
但我想要输出如下。
(Average of delta_time in monthly basis)
owner priority goal 2016-11 2016-10 2016-09
bob p1 40
p2 20
p3 30
alice p1 20
p3 40
ross p2 40
sec p2 40
所以如何在delta_time列上应用均值函数,这将取决于changed_time列,并且应该按照列表显示,如上表所示。我已经通过使用
进行分组而进入左侧df[['owner','priority','Goal','Delta_time']].groupby([df.statusChangedtime.dt.month,'owner','priority']).mean()
答案 0 :(得分:0)
我会做这样的事情:
df.groupby(['priority','owner',pd.PeriodIndex(data=df.changed_time, freq='M')]]).mean().unstack()
要获得所需的格式,我认为你需要做两个单独的groupby然后concat:
df1 = df.groupby(['priority','owner'])['goal'].mean()
# You can also do: pd.DatetimeIndex(data=df.changed_time).month
df2 = df.groupby(['priority','owner',pd.PeriodIndex(data=df.changed_time, freq='M')])['delta_time'].mean().unstack()
pd.concat([df1, df2], axis=1)
答案 1 :(得分:0)
由于存在多种语言标签,因此问题的答案应该是什么语言并不清楚。一般来说,问题需要集中在一起,而且只能处理一种语言。这可能是为什么很多人都倾向于这个问题。
无论如何,如果您想要R中的答案,可以选择以下方法:
1)reshape2 添加年/月列("time"
)并使用mean
从长到长重塑:
library(reshape2)
df2 <- transform(df, time = substr(changed_time, 1, 7))
dcast(df2, owner + priority + goal ~ time, mean, value.var = "delta_time", fill = NA_real_)
,并提供:
owner priority goal 2016-09 2016-10 2016-11
1 alice P1 20 NA 70.0 NA
2 alice P3 40 34 NA NA
3 bob P1 40 20 14.5 20
4 bob P2 20 NA NA 10
5 bob P3 30 NA NA 70
6 ross P2 40 10 NA NA
7 sec P2 40 34 NA NA
2)dplyr / tidyr 添加月/年"time"
列,并按owner
,priority
,goal
计算平均delta_time和time
。最后从长到高转换并排序。
library(dplyr)
library(tidyr)
df %>%
mutate(time = substr(changed_time, 1, 7)) %>%
group_by(owner, priority, goal, time) %>%
summarize(delta_time = mean(delta_time)) %>%
ungroup() %>%
spread(time, delta_time) %>%
arrange(owner, priority)
,并提供:
# A tibble: 7 x 6
owner priority goal 2016-09 2016-10 2016-11
<fctr> <fctr> <int> <dbl> <dbl> <dbl>
1 alice P1 20 NA 70.0 NA
2 alice P3 40 34 NA NA
3 bob P1 40 20 14.5 20
4 bob P2 20 NA NA 10
5 bob P3 30 NA NA 70
6 ross P2 40 10 NA NA
7 sec P2 40 34 NA NA
3)没有软件包添加年/月列("time"
),使用aggregate
计算均值,然后使用reshape
从长转换为宽并排序:
df2 <- transform(df, time = substr(changed_time, 1, 7))
ag <- aggregate(delta_time ~ owner + priority + goal + time, df2, mean)
nms <- unique(as.character(sort(ag$time)))
r <- reshape(ag, dir = "wide", idvar = c("owner", "priority", "goal"), varying = list(nms))
o <- order(r$owner, r$priority)
r[o, ]
,并提供:
owner priority goal 2016-09 2016-10 2016-11
5 alice P1 20 NA 70.0 NA
4 alice P3 40 34 NA NA
1 bob P1 40 20 14.5 20
7 bob P2 20 NA NA 10
8 bob P3 30 NA NA 70
2 ross P2 40 10 NA NA
3 sec P2 40 34 NA NA
注意:可重复形式的输入数据框df
为:
Lines <- "id priority owner goal changed_time delta_time
1 P1 bob 40 2016-11-02 20
2 P2 bob 20 2016-11-02 10
3 P3 bob 30 2016-11-02 50
4 P1 alice 20 2016-10-02 70
5 P1 bob 40 2016-10-02 05
6 P1 bob 40 2016-10-02 24
7 P3 alice 40 2016-09-02 34
8 P1 bob 40 2016-09-02 20
9 P2 ross 40 2016-09-02 10
10 P1 bob 40 2016-11-02 20
11 P2 sec 40 2016-09-02 34
12 P3 bob 30 2016-11-02 90"
df <- read.table(text = Lines, header = TRUE)