处理ID重复的表格

时间:2015-05-01 14:00:57

标签: r

我不是R初学者,但我真的很难解决我的问题。我的问题是:我有一个数据框(这是一个例子)。

id name dateA 
1   A   150
1   A   160
2   B   110
2   B   1009
2   B   098
2   B   309
3   C   218
3   C   310
4   D   219

我想创建3个新列(minA,maxA,repA)

minA == min(of dateA for each id)
maxA == max(of dateA for each id)
repA == number of repetition for each id;


id name dateA minA maxA repA
1   A   150
1   A   160
2   B   110
2   B   1009
2   B   098
2   B   309
3   C   218
3   C   310
4   D   219

感谢您的帮助。希望我足够清楚。

3 个答案:

答案 0 :(得分:4)

你可以尝试

library(data.table)#v1.9.5+
setDT(df1)[,c('minA', 'maxA', 'repA') := list(min(dateA), max(dateA), 
                    .N) , by= id]

更新

对于更新后的数据集,我们会创建列' minA',' maxA',' repA'和之前一样。通过(:=)分配min(dateA)max(dateA).N按“ID&ID”分组。将键列设置为' id' (setkey(.., id)),加入从重塑' long'获得的输出。广泛的'格式(dcast(df2, ..)

  setkey(setDT(df2)[, c('minA', 'maxA', 'repA') := list(min(dateA),
        max(dateA), .N) , by= id], id)[
          dcast(df2, id~typeP, value.var='typeP', length)]
  #    id name dateA typeP minA maxA repA P1 P2 P3
  #1:  1    A   150    P1  150  160    2  2  0  0
  #2:  1    A   160    P1  150  160    2  2  0  0
  #3:  2    B   110    P2   98 1009    4  1  3  0
  #4:  2    B  1009    P2   98 1009    4  1  3  0
  #5:  2    B    98    P1   98 1009    4  1  3  0
  #6:  2    B   309    P2   98 1009    4  1  3  0
  #7:  3    C   218    P2  218  310    2  0  1  1
  #8:  3    C   310    P3  218  310    2  0  1  1
  #9:  4    D   219    P1  219  219    1  1  0  0

数据

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L),
 name = c("A", 
"A", "B", "B", "B", "B", "C", "C", "D"), dateA = c(150L, 160L, 
110L, 1009L, 98L, 309L, 218L, 310L, 219L)), .Names = c("id", 
"name", "dateA"), class = "data.frame", row.names = c(NA, -9L))

df2 <- structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L), 
 name = c("A", 
"A", "B", "B", "B", "B", "C", "C", "D"), dateA = c(150L, 160L, 
110L, 1009L, 98L, 309L, 218L, 310L, 219L), typeP = c("P1", "P1", 
"P2", "P2", "P1", "P2", "P2", "P3", "P1")), .Names = c("id", 
"name", "dateA", "typeP"), class = "data.frame",
 row.names = c(NA, -9L))

答案 1 :(得分:2)

使用dplyr

require(dplyr)    
Data <- Data %>%
      group_by(id) %>%
      mutate(minA = min(dateA), maxA  = max(dateA), repA = n())

给予

> Data
Source: local data frame [9 x 6]
Groups: id

  id name dateA minA maxA repA
1  1    A   150  150  160    2
2  1    A   160  150  160    2
3  2    B   110   98 1009    4
4  2    B  1009   98 1009    4
5  2    B    98   98 1009    4
6  2    B   309   98 1009    4
7  3    C   218  218  310    2
8  3    C   310  218  310    2
9  4    D   219  219  219    1

答案 2 :(得分:1)

您可以按如下方式使用data.table

setDT(dat)
setkey(dat, id) #this makes the last line join on id
agg_dat <- dat[,.(minA = min(dateA), maxA = max(dateA), repA = .N), by = id]
dat[agg_dat]

其中agg_dat包含聚合数据,dat[agg_dat]通过ID将聚合数据加入数据集。