我有一个值如下所示的数据表:
Score ID
1.2 1
2.4 1
1.1 1
1.9 1
2.4 2
3.5 2
2.2 2
1.1 3
1.1 3
1.7 3
3.1 3
2.2 3
我真正需要的是在单独的列中计算每个ID的更改分数,使其如下所示:
Score ID Changes
1.2 1
2.4 1 1.2
1.1 1 -1.3
1.9 1 0.8
2.4 2
3.5 2 1.1
2.2 2 -1.3
1.1 3
1.1 3 0
1.7 3 0.6
3.1 3 1.4
2.2 3 -0.9
注意每个"如何改变"通过获取先前的分数并从当前分数中减去它来计算:在第一个变化中,你得到1.2乘1.2 = 2.4(当前得分) - 1.2(前一个得分)
问题是这些分数只能仅与ID范围内的分数有关。你不能只是迭代并取得分数的差异。我该怎么做?我可以使用什么逻辑来测试包含"更改"在同一个ID内?我熟悉R或python(或BASH)来做到这一点。
答案 0 :(得分:5)
在R可能:
transform(
df,
Changes = ave(df$Score, df$ID, FUN = function(x) c(NA, diff(x)))
)
# Score ID Changes
# 1 1.2 1 NA
# 2 2.4 1 1.2
# 3 1.1 1 -1.3
# 4 1.9 1 0.8
# 5 2.4 2 NA
# 6 3.5 2 1.1
# 7 2.2 2 -1.3
# 8 1.1 3 NA
# 9 1.1 3 0.0
# 10 1.7 3 0.6
# 11 3.1 3 1.4
# 12 2.2 3 -0.9
df
保留您的原始数据框。
关于@Bulat的评论:
id<- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
score <- c(1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2)
library(data.table)
df <- data.frame(id, score)
dt = data.table(id, score)
library(microbenchmark)
microbenchmark(
dt = dt[ , diff := c(NA, diff(score)), by = id],
df = df$diff <- ave(df$score, df$id, FUN = function(x) c(NA, diff(x))),
times = 1000
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# dt 1121.931 1225.2660 1342.4626 1269.5530 1321.2210 5908.411 1000 b
# df 397.175 488.2085 547.8198 525.8175 586.7995 7784.270 1000 a
答案 1 :(得分:3)
在Python中使用Pandas:
import pandas as pd
df = pd.DataFrame(
{'ID': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'Score': [1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2]})
df['Changes'] = df.groupby('ID').Score.transform(lambda group: group.diff())
>>> df
Score ID Changes
0 1.2 1 NaN
1 2.4 1 1.2
2 1.1 1 -1.3
3 1.9 1 0.8
4 2.4 2 NaN
5 3.5 2 1.1
6 2.2 2 -1.3
7 1.1 3 NaN
8 1.1 3 0.0
9 1.7 3 0.6
10 3.1 3 1.4
11 2.2 3 -0.9
答案 2 :(得分:2)
R与data.table
:
id<- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
score <- c(1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2)
library(data.table)
df = data.table(id, score)
df[ , diff := c(NA, diff(score)), by = id]
值得一提的是,与transform
方法相比,这快了10倍:
id<- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
id <- rep(id, 10^5)
score <- c(1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2)
score <- rep(score, 10^5)
library(data.table)
dt = data.table(id, score)
df = data.frame(id, score)
library(microbenchmark)
m <- microbenchmark(
"data.table" = dt[ , diff := c(NA, diff(score)), by = id],
"transform" = transform(
df,
Changes = ave(df$score, df$id, FUN = function(x) c(NA, diff(x)))
), times = 10
)
boxplot(m)
print(m)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 95.1905 100.3342 111.2434 102.6525 106.7417 151.2913 10
# transform 1032.3829 1066.3912 1078.5727 1070.9577 1103.4971 1135.7380 10
答案 3 :(得分:0)
在(.*)$
dplyr
R