R:数据帧修改,代码性能增强

时间:2014-09-04 22:51:02

标签: r performance dataframe

我在R中的数据框中的采样数据看起来像这样。

 NAME  ID SURVEY_YEAR REFERENCE_YEAR SUM SUM_REFYEAR
1 NAME1  47        1960           1959  -6           0
2 NAME1  47        1961           1960 -10          -6
3 NAME1  47        1965           1963 -23         -10
4 NAME2 259        2007           2004  -9           0
5 NAME2 259        2010           2009  NA           0
6 NAME2 259        2014           2011 -40          -9
7 NAME3 765        1888           1885   5           0
8 NAME3 765        1889           1888  12           5
9 NAME3 765        1890           1889  22          12

我正在使用下面的代码修改数据,从而生成此数据帧。

    NAME  ID SURVEY_YEAR REFERENCE_YEAR SUM SUM_REFYEAR
1  NAME1  47        1960           1959  -6           0
2  NAME1  47        1961           1960 -10          -6
3  NAME1  47        1963           1961  NA          NA
4  NAME1  47        1965           1963 -23         -10
5  NAME2 259        2007           2004  -9           0
6  NAME2 259        2009           2007  NA          NA
7  NAME2 259        2010           2009  NA           0
8  NAME2 259        2011           2010  NA          NA
9  NAME2 259        2014           2011 -40          -9
10 NAME3 765        1888           1885   5           0
11 NAME3 765        1889           1888  12           5
12 NAME3 765        1890           1889  22          12

代码本身正在做我想要它做的事情(在REFERENCE_YEAR和SURVEY_YEAR之间填充NA值)。但是,处理更大的数据集需要很长时间,我想知道是否有人知道如何优化这一步骤以便我达到更快的性能?

这是我的代码:

# read in data
data <- data.frame(NAME=c("NAME1", "NAME1", "NAME1","NAME2","NAME2","NAME2","NAME3","NAME3","NAME3" ),
                   ID=c(47,47,47,259,259,259,765,765,765),
                   SURVEY_YEAR=c(1960,1961,1965,2007,2010,2014,1888,1889,1890), 
                   REFERENCE_YEAR=c(1959,1960,1963,2004,2009,2011,1885,1888,1889),
                   SUM=c(-6,-10,-23,-9,NA,-40,5,12,22),
                   SUM_REFYEAR=c(0,-6,-10,0,0,-9,0,5,12))

# NA Fill between REFERENCE_YEAR and SURVEY_YEAR
i <- 1
while (i<=length(data$SUM)-1) {
  if (data$ID[i+1]==data$ID[i]) {   
    # Check if row needs to be added
    ref <- data$REFERENCE_YEAR[i+1]
    surv <- data$SURVEY_YEAR[i]
    if (ref-surv >= 1) { 
      # Add row
      data[seq(i+2,nrow(data)+1),] <- data[seq(i+1,nrow(data)),]
      data[i+1,1:2] <- data[i,1:2]
      data[i+1,3:6] <- c(ref ,surv , NA, NA)
    }
  }
  i <- i+1
}

感谢您的帮助!

1 个答案:

答案 0 :(得分:5)

使用data.table并考虑合并。

library(data.table)
# coerce `data` to a `data.table`
setDT(data)
# get list of all survey and reference years you wish to create
all_years <- data[,{
         ay <- sort(unique(c(SURVEY_YEAR, REFERENCE_YEAR)))
         list(SURVEY_YEAR= tail(ay, -1), REFERENCE_YEAR = head(ay, -1))
       },by=list(NAME, ID)]

# set keys for merging
setkey(data, NAME,ID, SURVEY_YEAR, REFERENCE_YEAR)
setkey(all_years, NAME,ID, SURVEY_YEAR, REFERENCE_YEAR)
# merge to create your required data set
data[all_years]


#      NAME  ID SURVEY_YEAR REFERENCE_YEAR SUM SUM_REFYEAR
#  1: NAME1  47        1960           1959  -6           0
#  2: NAME1  47        1961           1960 -10          -6
#  3: NAME1  47        1963           1961  NA          NA
#  4: NAME1  47        1965           1963 -23         -10
#  5: NAME2 259        2007           2004  -9           0
#  6: NAME2 259        2009           2007  NA          NA
#  7: NAME2 259        2010           2009  NA           0
#  8: NAME2 259        2011           2010  NA          NA
#  9: NAME2 259        2014           2011 -40          -9
# 10: NAME3 765        1888           1885   5           0
# 11: NAME3 765        1889           1888  12           5
# 12: NAME3 765        1890           1889  22          12