R:数据帧组织,结构化和子集化数据帧

时间:2014-09-02 21:56:57

标签: r dataframe

我在R中有以下数据框(仅示例数据)

data <- data.frame(NAME=c("NAME1", "NAME1", "NAME1","NAME2","NAME2","NAME2"),
                   ID=c(47,47,47,259,259,259),
                   SURVEY_YEAR=c(1960,1961,1965,2007,2010,2014), 
                   REFERENCE_YEAR=c(1959,1960,1963,2004,2009,2011),
                   CUMULATIVE_SUM=c(-6,-10,-23,-9,NA,-40))

以表格形式显示如下:

  NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM
1 NAME1  47        1960           1959             -6
2 NAME1  47        1961           1960            -10
3 NAME1  47        1965           1963            -23
4 NAME2 259        2007           2004             -9
5 NAME2 259        2010           2009             NA
6 NAME2 259        2014           2011            -40

我要做的是重构我的数据框,以便它最终看起来像这样:

   NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
1 NAME1  47        1960           1959             -6                      0
2 NAME1  47        1961           1960            -10                     -6
3 NAME1  47        1965           1963            -23                    -10
4 NAME2 259        2007           2004             -9                      0
5 NAME2 259        2010           2009             NA                     NA
6 NAME2 259        2014           2011            -40                     -9

我正在尝试使用以下代码实现此目的:

# loop through elements in data$CUMULATIVE_SUM
for (i in 1:length(data$CUMULATIVE_SUM)) { 
  # take value of upper row, but take NULL if in upper row there is another NAME or end of table
  if (i==1) {
    value=0 # If first row
  } else { 
    if (data$NAME[i-1]==data$NAME[i]) {
      value=data$CUMULATIVE_SUM[i-1] # Normal case: take upper value
    } else {
      value=0 # If other NAME
    }
  }
  data$CUMULATIVE_SUM_REFYEAR[i] <- value # Write new value in new column
}

使用此代码,上面代码的结果如下所示:

   NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
1 NAME1  47        1960           1959             -6                      0
2 NAME1  47        1961           1960            -10                     -6
3 NAME1  47        1965           1963            -23                    -10
4 NAME2 259        2007           2004             -9                      0
5 NAME2 259        2010           2009             NA                     **-9**
6 NAME2 259        2014           2011            -40                     NA

您可能已经注意到,在将其与我想要的解决方案进行比较时,-9的值位于错误的位置(以粗体标记)。如果一行中有NA值,有没有办法解决这个问题?我被卡住了。谢谢你的帮助!

2 个答案:

答案 0 :(得分:2)

尝试

library(data.table)
setDT(data)[!is.na(CUMULATIVE_SUM), 
            CUMULATIVE_SUM_REFYEAR := c(0, CUMULATIVE_SUM[-.N]),
            by = NAME]
data
#     NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
# 1: NAME1  47        1960           1959             -6                      0
# 2: NAME1  47        1961           1960            -10                     -6
# 3: NAME1  47        1965           1963            -23                    -10
# 4: NAME2 259        2007           2004             -9                      0
# 5: NAME2 259        2010           2009             NA                     NA
# 6: NAME2 259        2014           2011            -40                     -9

答案 1 :(得分:1)

使用dplyr

  library(dplyr)
  left_join(data, data %>%
  group_by(NAME) %>% 
  filter(!is.na(CUMULATIVE_SUM)) %>%
  mutate(CUMULATIVE_SUM_REFYEAR= lag(CUMULATIVE_SUM, 1, 0)))
  #      NAME  ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR
  #1 NAME1  47        1960           1959             -6                      0
  #2 NAME1  47        1961           1960            -10                     -6
  #3 NAME1  47        1965           1963            -23                    -10
  #4 NAME2 259        2007           2004             -9                      0
  #5 NAME2 259        2010           2009             NA                     NA
  #6 NAME2 259        2014           2011            -40                     -9