R从预先指定的序列中识别缺失的行

时间:2015-12-07 06:10:56

标签: r

我是一个较新的R转换器,我有一个正在进行的研究的大型纵向数据集。数据组织一个主题,每行一次访问。可能的访问次数取决于每个科目何时注册。我想确定哪些访问丢失,或者确实是在预期访问顺序中断的地方。数据示例可能如下所示:

  id  visit
1001     BL
1001 Week12
1001 Week24
1001 Week36
1002     BL
1002 Week12
1002 Week36
1002 Week48
1002 Week60
1002 Week72
1003     BL
1003 Week12
1003 Week24

我正在寻找的输出看起来非常理想:

id   visit_missing
1002        Week24

1 个答案:

答案 0 :(得分:0)

好的,这是一个非常可怕的方法,但是凌晨3点,这是我现在能想到的最好的。

## Creating the dataset
id <- c(rep(1001, 5), rep(1002, 6), rep(1003, 3))
visit <- c('BL', 'Week12', 'Week24', 'Week36', 'Week72', 'BL', 'Week12', 'Week36', 'Week48', 'Week60', 'Week72', 'BL', 'Week12', 'Week24')
data <- data.frame(id, visit)
data
## What the data looks like
     id visit
2  1001    12
3  1001    24
4  1001    36
5  1001    72
7  1002    12
8  1002    36
9  1002    48
10 1002    60
11 1002    72
13 1003    12
14 1003    24

## Removing all the text from the visit column and retaining only numbers
data$visit <- gsub('[^0-9]', '', data$visit)
data$visit <- as.numeric(data$visit)
data <- data[!is.na(data$visit), ]

missing <- data.frame()

## Checks if the difference between two consecutive entries in the visit column is more than 12 for the same id, and adds all the missing values to the missing data frame
for (i in 2:nrow(data)) {
  if ((data$visit[i] - data$visit[i - 1] > 12) & (data$id[i] == data$id[i - 1])) {
    x <- data$visit[i] - data$visit[i - 1]
    while (x > 12) {
      x <- x - 12
      missing <- rbind(missing, data.frame(data$id[i], data$visit[i] - x))
    }
  }
}

missing
## Output
      data.id.i. data.visit.i....x
1       1001                48
2       1001                60
3       1002                24