我在R中有以下数据框(仅示例数据)
data <- data.frame(NAME=c("NAME1", "NAME1", "NAME1","NAME2","NAME2","NAME2"),
ID=c(47,47,47,259,259,259),
SURVEY_YEAR=c(1960,1961,1965,2007,2010,2014),
REFERENCE_YEAR=c(1959,1960,1963,2004,2009,2011),
CUMULATIVE_SUM=c(-6,-10,-23,-9,NA,-40))
以表格形式显示如下:
NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM
1 NAME1 47 1960 1959 -6
2 NAME1 47 1961 1960 -10
3 NAME1 47 1965 1963 -23
4 NAME2 259 2007 2004 -9
5 NAME2 259 2010 2009 NA
6 NAME2 259 2014 2011 -40
我要做的是重构我的数据框,以便它最终看起来像这样:
NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
1 NAME1 47 1960 1959 -6 0
2 NAME1 47 1961 1960 -10 -6
3 NAME1 47 1965 1963 -23 -10
4 NAME2 259 2007 2004 -9 0
5 NAME2 259 2010 2009 NA NA
6 NAME2 259 2014 2011 -40 -9
我正在尝试使用以下代码实现此目的:
# loop through elements in data$CUMULATIVE_SUM
for (i in 1:length(data$CUMULATIVE_SUM)) {
# take value of upper row, but take NULL if in upper row there is another NAME or end of table
if (i==1) {
value=0 # If first row
} else {
if (data$NAME[i-1]==data$NAME[i]) {
value=data$CUMULATIVE_SUM[i-1] # Normal case: take upper value
} else {
value=0 # If other NAME
}
}
data$CUMULATIVE_SUM_REFYEAR[i] <- value # Write new value in new column
}
使用此代码,上面代码的结果如下所示:
NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
1 NAME1 47 1960 1959 -6 0
2 NAME1 47 1961 1960 -10 -6
3 NAME1 47 1965 1963 -23 -10
4 NAME2 259 2007 2004 -9 0
5 NAME2 259 2010 2009 NA **-9**
6 NAME2 259 2014 2011 -40 NA
您可能已经注意到,在将其与我想要的解决方案进行比较时,-9的值位于错误的位置(以粗体标记)。如果一行中有NA值,有没有办法解决这个问题?我被卡住了。谢谢你的帮助!
答案 0 :(得分:2)
尝试
library(data.table)
setDT(data)[!is.na(CUMULATIVE_SUM),
CUMULATIVE_SUM_REFYEAR := c(0, CUMULATIVE_SUM[-.N]),
by = NAME]
data
# NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
# 1: NAME1 47 1960 1959 -6 0
# 2: NAME1 47 1961 1960 -10 -6
# 3: NAME1 47 1965 1963 -23 -10
# 4: NAME2 259 2007 2004 -9 0
# 5: NAME2 259 2010 2009 NA NA
# 6: NAME2 259 2014 2011 -40 -9
答案 1 :(得分:1)
使用dplyr
library(dplyr)
left_join(data, data %>%
group_by(NAME) %>%
filter(!is.na(CUMULATIVE_SUM)) %>%
mutate(CUMULATIVE_SUM_REFYEAR= lag(CUMULATIVE_SUM, 1, 0)))
# NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
#1 NAME1 47 1960 1959 -6 0
#2 NAME1 47 1961 1960 -10 -6
#3 NAME1 47 1965 1963 -23 -10
#4 NAME2 259 2007 2004 -9 0
#5 NAME2 259 2010 2009 NA NA
#6 NAME2 259 2014 2011 -40 -9