如何计算特定于每行提供的ID的值?

时间:2016-05-11 20:53:42

标签: python r pandas

我有一个值如下所示的数据表:

Score ID
1.2    1
2.4    1
1.1    1
1.9    1
2.4    2
3.5    2
2.2    2
1.1    3
1.1    3
1.7    3
3.1    3
2.2    3

我真正需要的是在单独的列中计算每个ID的更改分数,使其如下所示:

Score   ID  Changes
1.2     1   
2.4     1   1.2
1.1     1   -1.3
1.9     1   0.8
2.4     2   
3.5     2   1.1
2.2     2   -1.3
1.1     3   
1.1     3   0
1.7     3   0.6
3.1     3   1.4
2.2     3   -0.9

注意每个"如何改变"通过获取先前的分数并从当前分数中减去它来计算:在第一个变化中,你得到1.2乘1.2 = 2.4(当前得分) - 1.2(前一个得分)

问题是这些分数只能仅与ID范围内的分数有关。你不能只是迭代并取得分数的差异。我该怎么做?我可以使用什么逻辑来测试包含"更改"在同一个ID内?我熟悉R或python(或BASH)来做到这一点。

4 个答案:

答案 0 :(得分:5)

在R可能:

transform(
  df, 
  Changes = ave(df$Score, df$ID, FUN = function(x) c(NA, diff(x)))
)
#    Score ID Changes
# 1    1.2  1      NA
# 2    2.4  1     1.2
# 3    1.1  1    -1.3
# 4    1.9  1     0.8
# 5    2.4  2      NA
# 6    3.5  2     1.1
# 7    2.2  2    -1.3
# 8    1.1  3      NA
# 9    1.1  3     0.0
# 10   1.7  3     0.6
# 11   3.1  3     1.4
# 12   2.2  3    -0.9

df保留您的原始数据框。

关于@Bulat的评论:

id<- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
score <-  c(1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2)
library(data.table)
df <- data.frame(id, score)
dt = data.table(id, score)
library(microbenchmark)
microbenchmark(
  dt = dt[ , diff := c(NA, diff(score)), by = id],
  df = df$diff <- ave(df$score, df$id, FUN = function(x) c(NA, diff(x))),
  times = 1000
)
# Unit: microseconds
#  expr      min        lq      mean    median        uq      max neval cld
#    dt 1121.931 1225.2660 1342.4626 1269.5530 1321.2210 5908.411  1000   b
#    df  397.175  488.2085  547.8198  525.8175  586.7995 7784.270  1000  a 

答案 1 :(得分:3)

在Python中使用Pandas:

import pandas as pd

df = pd.DataFrame(
        {'ID': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
         'Score': [1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2]})

df['Changes'] = df.groupby('ID').Score.transform(lambda group: group.diff())

>>> df
    Score  ID  Changes
0     1.2   1      NaN
1     2.4   1      1.2
2     1.1   1     -1.3
3     1.9   1      0.8
4     2.4   2      NaN
5     3.5   2      1.1
6     2.2   2     -1.3
7     1.1   3      NaN
8     1.1   3      0.0
9     1.7   3      0.6
10    3.1   3      1.4
11    2.2   3     -0.9

答案 2 :(得分:2)

R与data.table

id<- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
score <-  c(1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2)
library(data.table)
df = data.table(id, score)

df[ , diff := c(NA, diff(score)), by = id]

值得一提的是,与transform方法相比,这快了10倍:

id<- c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3)
id <- rep(id, 10^5)
score <-  c(1.2, 2.4, 1.1, 1.9, 2.4, 3.5, 2.2, 1.1, 1.1, 1.7, 3.1, 2.2)
score <- rep(score, 10^5)
library(data.table)
dt = data.table(id, score)
df = data.frame(id, score)

library(microbenchmark)
m <- microbenchmark(
  "data.table" = dt[ , diff := c(NA, diff(score)), by = id],
  "transform" = transform(
    df, 
    Changes = ave(df$score, df$id, FUN = function(x) c(NA, diff(x)))
  ), times = 10
)

boxplot(m)
print(m)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval
# data.table   95.1905  100.3342  111.2434  102.6525  106.7417  151.2913    10
#  transform 1032.3829 1066.3912 1078.5727 1070.9577 1103.4971 1135.7380    10

答案 3 :(得分:0)

(.*)$

中使用dplyr
R