我在R中的数据框中的采样数据看起来像这样。
NAME ID SURVEY_YEAR REFERENCE_YEAR SUM SUM_REFYEAR
1 NAME1 47 1960 1959 -6 0
2 NAME1 47 1961 1960 -10 -6
3 NAME1 47 1965 1963 -23 -10
4 NAME2 259 2007 2004 -9 0
5 NAME2 259 2010 2009 NA 0
6 NAME2 259 2014 2011 -40 -9
7 NAME3 765 1888 1885 5 0
8 NAME3 765 1889 1888 12 5
9 NAME3 765 1890 1889 22 12
我正在使用下面的代码修改数据,从而生成此数据帧。
NAME ID SURVEY_YEAR REFERENCE_YEAR SUM SUM_REFYEAR
1 NAME1 47 1960 1959 -6 0
2 NAME1 47 1961 1960 -10 -6
3 NAME1 47 1963 1961 NA NA
4 NAME1 47 1965 1963 -23 -10
5 NAME2 259 2007 2004 -9 0
6 NAME2 259 2009 2007 NA NA
7 NAME2 259 2010 2009 NA 0
8 NAME2 259 2011 2010 NA NA
9 NAME2 259 2014 2011 -40 -9
10 NAME3 765 1888 1885 5 0
11 NAME3 765 1889 1888 12 5
12 NAME3 765 1890 1889 22 12
代码本身正在做我想要它做的事情(在REFERENCE_YEAR和SURVEY_YEAR之间填充NA值)。但是,处理更大的数据集需要很长时间,我想知道是否有人知道如何优化这一步骤以便我达到更快的性能?
这是我的代码:
# read in data
data <- data.frame(NAME=c("NAME1", "NAME1", "NAME1","NAME2","NAME2","NAME2","NAME3","NAME3","NAME3" ),
ID=c(47,47,47,259,259,259,765,765,765),
SURVEY_YEAR=c(1960,1961,1965,2007,2010,2014,1888,1889,1890),
REFERENCE_YEAR=c(1959,1960,1963,2004,2009,2011,1885,1888,1889),
SUM=c(-6,-10,-23,-9,NA,-40,5,12,22),
SUM_REFYEAR=c(0,-6,-10,0,0,-9,0,5,12))
# NA Fill between REFERENCE_YEAR and SURVEY_YEAR
i <- 1
while (i<=length(data$SUM)-1) {
if (data$ID[i+1]==data$ID[i]) {
# Check if row needs to be added
ref <- data$REFERENCE_YEAR[i+1]
surv <- data$SURVEY_YEAR[i]
if (ref-surv >= 1) {
# Add row
data[seq(i+2,nrow(data)+1),] <- data[seq(i+1,nrow(data)),]
data[i+1,1:2] <- data[i,1:2]
data[i+1,3:6] <- c(ref ,surv , NA, NA)
}
}
i <- i+1
}
感谢您的帮助!
答案 0 :(得分:5)
使用data.table
并考虑合并。
library(data.table)
# coerce `data` to a `data.table`
setDT(data)
# get list of all survey and reference years you wish to create
all_years <- data[,{
ay <- sort(unique(c(SURVEY_YEAR, REFERENCE_YEAR)))
list(SURVEY_YEAR= tail(ay, -1), REFERENCE_YEAR = head(ay, -1))
},by=list(NAME, ID)]
# set keys for merging
setkey(data, NAME,ID, SURVEY_YEAR, REFERENCE_YEAR)
setkey(all_years, NAME,ID, SURVEY_YEAR, REFERENCE_YEAR)
# merge to create your required data set
data[all_years]
# NAME ID SURVEY_YEAR REFERENCE_YEAR SUM SUM_REFYEAR
# 1: NAME1 47 1960 1959 -6 0
# 2: NAME1 47 1961 1960 -10 -6
# 3: NAME1 47 1963 1961 NA NA
# 4: NAME1 47 1965 1963 -23 -10
# 5: NAME2 259 2007 2004 -9 0
# 6: NAME2 259 2009 2007 NA NA
# 7: NAME2 259 2010 2009 NA 0
# 8: NAME2 259 2011 2010 NA NA
# 9: NAME2 259 2014 2011 -40 -9
# 10: NAME3 765 1888 1885 5 0
# 11: NAME3 765 1889 1888 12 5
# 12: NAME3 765 1890 1889 22 12