从dataframe获取嵌套值

时间:2015-04-10 17:00:33

标签: r plyr

我试图在列event中获取最大值,直到达到agreement(虚拟);事件嵌套在协议中,协议嵌套在dyad year上。请注意,年份并不总是连续的,这意味着这些年间(1986年,1987年,2001年,2002年)有休息。

我可以使用ddply和max(事件)获得dyad中的最大值;但我在努力如何将不同的事件“分配”给正确的协议(直到/之后)。我基本上缺少一个标识符'它将每个观察分配给协议。

我正在寻找的结果已经列在"结果"。

dyad    year    event   agreement   agreement.name  result  
  1     1985    9           
  1     1986    4       1           agreement1       9 
  1     1987    
  1     2001    3       
  1     2002            1           agreement2       3
  2     1999    1       
  2     2000    5            
  2     2001            1           agreement3       5 
  2     2002    2       
  2     2003                
  2     2004    1                   agreement 4      2

以下是希望更容易使用的格式的数据:

df<-structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L, 
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L, 
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA, 
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "", 
"agreement2", "", "", "agreement3", "", "", "agreement 4"), result = c(NA, 
9L, NA, NA, 3L, NA, NA, 5L, NA, NA, 2L)), .Names = c("dyad", 
"year", "event", "agreement", "agreement.name", "result"), class = "data.frame", row.names = c(NA, 
-11L))

1 个答案:

答案 0 :(得分:1)

以下是使用data.table的选项。将'data.frame'转换为'data.table'(setDT(df)),根据'agreement.name'中的非空元素创建另一个分组变量('ind')。通过'dyad'和'ind'列分组,我们使用ifelse创建了一个新列'结果',以填充'agreement.name'非空的行max'事件“

library(data.table)
setDT(df)[, ind:=cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad][,
    result:=ifelse(agreement.name!='', max(event, na.rm=TRUE), NA) ,
                list(dyad, ind)][, ind:=NULL][]
#       dyad year event agreement agreement.name result
# 1:    1 1985     9        NA                    NA
# 2:    1 1986     4         1     agreement1      9
# 3:    1 1987    NA        NA                    NA
# 4:    1 2001     3        NA                    NA
# 5:    1 2002    NA         1     agreement2      3
# 6:    2 1999     1        NA                    NA
# 7:    2 2000     5        NA                    NA
# 8:    2 2001    NA         1     agreement3      5
# 9:    2 2002     2        NA                    NA
#10:    2 2003    NA        NA                    NA
#11:    2 2004    NA         1    agreement 4      2

或者代替ifelse,我们可以使用数字索引

setDT(df)[, result:=c(NA, max(event, na.rm=TRUE))[(agreement.name!='')+1L] ,
   list(ind= cumsum(c(TRUE,diff(agreement.name=='')>0)),dyad)][]

数据

df <- structure(list(dyad = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
2L), year = c(1985L, 1986L, 1987L, 2001L, 2002L, 1999L, 2000L, 
2001L, 2002L, 2003L, 2004L), event = c(9L, 4L, NA, 3L, NA, 1L, 
5L, NA, 2L, NA, NA), agreement = c(NA, 1L, NA, NA, 1L, NA, NA, 
1L, NA, NA, 1L), agreement.name = c("", "agreement1", "", "", 
"agreement2", "", "", "agreement3", "", "", "agreement 4")), 
.Names = c("dyad", 
"year", "event", "agreement", "agreement.name"), row.names = c(NA,
-11L), class = "data.frame")