我的数据库包含错误的个人标识id
。
我想找一个自动检测和纠正方法,但我无法弄清楚。
我只找到了手动的方法(非常麻烦)。
数据看起来像这样
id time
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 2 1
8 2 2
9 2 3
10 3 1
11 3 2
12 3 3
13 3 1
14 3 2
15 3 3
由于id
变量中包含的信息, id
2和time
3不正确。每次time
开始时,id
都应该更改。
我创建了一个count行变量和一个flag
(已更正)id
变量。
dta$row = 1:nrow(dta)
dta$id_f = dta$id
然后我手动纠正案例
dta[4:6, 'id_f'] <- paste( dta[4:6, 'id_f'], 'a', sep = '')
dta[7:9, 'id_f'] <- paste( dta[7:9, 'id_f'], 'b', sep = '')
dta[10:12, 'id_f'] <- paste( dta[10:12, 'id_f'], 'a', sep = '')
dta[13:15, 'id_f'] <- paste( dta[13:15, 'id_f'], 'b', sep = '')
您是否有任何线索如何避免手动?
我需要的输出如下id
更正
id time row id_f
1 1 1 1 1
2 1 2 2 1
3 1 3 3 1
4 2 1 4 2a
5 2 2 5 2a
6 2 3 6 2a
7 2 1 7 2b
8 2 2 8 2b
9 2 3 9 2b
10 3 1 10 3a
11 3 2 11 3a
12 3 3 12 3a
13 3 1 13 3b
14 3 2 14 3b
15 3 3 15 3b
数据
dta = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L), time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("id", "time"), class = "data.frame", row.names = c(NA,
-15L))
答案 0 :(得分:3)
这是一种可能性:
do.call(rbind,
by(dta, dta$id, function(x){
# identify cases with more than one "Each time that the time begin"
# I used "more than one 1" as criteria
if(sum(x$time == 1) > 1){
# diff: to detect non-consecutive times, i.e. differences not equal to one.
# cumsum: to create an index variable, used to index letters
x$id2 <- paste0(x$id, letters[cumsum(c(FALSE, diff(x$time) != 1)) + 1])
# for id with a correct sequence of "time", just use the original id
} else {
x$id2 <- x$id
}
x
}))
# id time id2
# 1.1 1 1 1
# 1.2 1 2 1
# 1.3 1 3 1
# 2.4 2 1 2a
# 2.5 2 2 2a
# 2.6 2 3 2a
# 2.7 2 1 2b
# 2.8 2 2 2b
# 2.9 2 3 2b
# 3.10 3 1 3a
# 3.11 3 2 3a
# 3.12 3 3 3a
# 3.13 3 1 3b
# 3.14 3 2 3b
# 3.15 3 3 3b
答案 1 :(得分:1)
不完全符合您的要求,但如果您可以容忍1a
没有1b
,那么它就会有效。但它要求在运行之前对数据进行适当的排序。
library(dplyr)
dta %>%
mutate(time_diff = c(-1, diff(time)),
new_time = (time_diff < 0),
time_id = cumsum(new_time),
row = 1:n()) %>%
group_by(id) %>%
mutate(time_id = time_id - (min(time_id) - 1),
time_id = letters[time_id],
id_f = paste0(id, time_id)) %>%
ungroup() %>%
select(id, time, row, id_f)
答案 2 :(得分:1)
I named the data frame z.
z$timediff <- c(0,diff(z$time)) < 0
z$iddiff <- c(0,diff(z$id))
z$timediffminusiddiff <- z$timediff - z$iddiff
z$cumsumtimediff <- cumsum(z$timediff)
z$haserr <- ave(z$timediffminusiddiff,z$id,FUN = max)
z$newnum <- letters[z$cumsumtimediff - ave(z$cumsumtimediff,z$id,FUN = min) + 1]
z[z$haserr == 1,'id'] <- paste0(z$id,z$newnum)[z$haserr == 1]
z[ ,c('id','time')]
You could squeeze this together into less lines, but then it's harder to read.