通过多次重复的时间序列检测并纠正分组变量

时间:2015-07-27 16:12:27

标签: r

我的数据库包含错误的个人标识id。 我想找一个自动检测和纠正方法,但我无法弄清楚。

我只找到了手动的方法(非常麻烦)。

数据看起来像这样

   id time
1   1    1
2   1    2
3   1    3
4   2    1
5   2    2
6   2    3
7   2    1
8   2    2
9   2    3
10  3    1
11  3    2
12  3    3
13  3    1
14  3    2
15  3    3
由于id变量中包含的信息,

id 2和time 3不正确。每次time开始时,id都应该更改。

我创建了一个count行变量和一个flag(已更正)id变量。

dta$row = 1:nrow(dta)
dta$id_f = dta$id

然后我手动纠正案例

dta[4:6, 'id_f'] <- paste( dta[4:6, 'id_f'], 'a', sep = '')
dta[7:9, 'id_f'] <- paste( dta[7:9, 'id_f'], 'b', sep = '')

dta[10:12, 'id_f'] <- paste( dta[10:12, 'id_f'], 'a', sep = '')
dta[13:15, 'id_f'] <- paste( dta[13:15, 'id_f'], 'b', sep = '')

您是否有任何线索如何避免手动

我需要的输出如下id更正

   id time row id_f
1   1    1   1    1
2   1    2   2    1
3   1    3   3    1
4   2    1   4   2a
5   2    2   5   2a
6   2    3   6   2a
7   2    1   7   2b
8   2    2   8   2b
9   2    3   9   2b
10  3    1  10   3a
11  3    2  11   3a
12  3    3  12   3a
13  3    1  13   3b
14  3    2  14   3b
15  3    3  15   3b

数据

dta = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L), time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("id", "time"), class = "data.frame", row.names = c(NA, 
-15L))

3 个答案:

答案 0 :(得分:3)

这是一种可能性:

do.call(rbind, 
        by(dta, dta$id, function(x){

          # identify cases with more than one "Each time that the time begin"
          # I used "more than one 1" as criteria
          if(sum(x$time == 1) > 1){

            # diff: to detect non-consecutive times, i.e. differences not equal to one.
            # cumsum: to create an index variable, used to index letters
            x$id2 <- paste0(x$id, letters[cumsum(c(FALSE, diff(x$time) != 1)) + 1])

          # for id with a correct sequence of "time", just use the original id
          } else {
            x$id2 <- x$id
          }
          x
        }))

#      id time id2
# 1.1   1    1   1
# 1.2   1    2   1
# 1.3   1    3   1
# 2.4   2    1  2a
# 2.5   2    2  2a
# 2.6   2    3  2a
# 2.7   2    1  2b
# 2.8   2    2  2b
# 2.9   2    3  2b
# 3.10  3    1  3a
# 3.11  3    2  3a
# 3.12  3    3  3a
# 3.13  3    1  3b
# 3.14  3    2  3b
# 3.15  3    3  3b

答案 1 :(得分:1)

不完全符合您的要求,但如果您可以容忍1a没有1b,那么它就会有效。但它要求在运行之前对数据进行适当的排序。

library(dplyr)
dta %>%
  mutate(time_diff = c(-1, diff(time)),
         new_time = (time_diff < 0),
         time_id = cumsum(new_time),
         row = 1:n()) %>%
  group_by(id) %>%
  mutate(time_id = time_id - (min(time_id) - 1),
         time_id = letters[time_id],
         id_f = paste0(id, time_id)) %>%
  ungroup() %>%
  select(id, time, row, id_f) 

答案 2 :(得分:1)

I named the data frame z.

z$timediff <- c(0,diff(z$time)) < 0
z$iddiff <- c(0,diff(z$id))
z$timediffminusiddiff <- z$timediff - z$iddiff
z$cumsumtimediff <- cumsum(z$timediff)

z$haserr <- ave(z$timediffminusiddiff,z$id,FUN = max)
z$newnum <- letters[z$cumsumtimediff - ave(z$cumsumtimediff,z$id,FUN = min) + 1]
z[z$haserr == 1,'id'] <- paste0(z$id,z$newnum)[z$haserr == 1]
z[ ,c('id','time')]

You could squeeze this together into less lines, but then it's harder to read.