我有一个数据表,其中有很多个人(id)已被问过n次问题(类)。有时候他们的回答是0
或99
(这些是#34的非回答代码;拒绝回答"以及"未知",但是当他们稍后被问到时回答这个问题。
如何替换ID中的0
或99
?
虚拟数据:
library(data.table)
df <- data.table(
id=rep(1:10,each=4),
class=c(1,1,1,1,1,1,1,99,0,0,0,1,0,2,2,2,99,99,99,
1,3,3,3,0,2,2,0,99,99,99,99,99,1,1,1,1,0,0,0,0))
我想得到什么
res <- data.table(
id=rep(1:10,each=4),
class=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,3,
2,2,2,2,99,99,99,99,1,1,1,1,0,0,0,0))
想象示例......
> cbind(df, res = res[, !"id"])
id class res.class
1: 1 1 1
2: 1 1 1
3: 1 1 1
4: 1 1 1
5: 2 1 1
6: 2 1 1
7: 2 1 1
8: 2 99 1
9: 3 0 1
10: 3 0 1
11: 3 0 1
12: 3 1 1
13: 4 0 2
14: 4 2 2
15: 4 2 2
16: 4 2 2
17: 5 99 1
18: 5 99 1
19: 5 99 1
20: 5 1 1
21: 6 3 3
22: 6 3 3
23: 6 3 3
24: 6 0 3
25: 7 2 2
26: 7 2 2
27: 7 0 2
28: 7 99 2
29: 8 99 99
30: 8 99 99
31: 8 99 99
32: 8 99 99
33: 9 1 1
34: 9 1 1
35: 9 1 1
36: 9 1 1
37: 10 0 0
38: 10 0 0
39: 10 0 0
40: 10 0 0
id class res.class
在实践中,我有大约100,000个人,这就是我标记为data.table的原因,尽管我对其他(更快)的建议持开放态度。
答案 0 :(得分:2)
使用data.table
,这也可以通过更新来解决,同时使用每个id
的查找表加入,替换{{1}中的所有class
值通过查找表的相应值。
查找表由
创建df
unique(df[!class %in% c(0,99)], by="id")
查找表仅包含具有至少一个有效答案的 id class
1: 1 1
2: 2 1
3: 3 1
4: 4 2
5: 5 1
6: 6 3
7: 7 2
8: 9 1
个条目。在随后的更新加入中,没有任何有效答案的其他id
将保持不变。
id
df[unique(df[!class %in% c(0,99)], by="id"), on = "id", class := i.class][]
id class
1: 1 1
2: 1 1
3: 1 1
4: 1 1
5: 2 1
6: 2 1
7: 2 1
8: 2 1
9: 3 1
10: 3 1
11: 3 1
12: 3 1
13: 4 2
14: 4 2
15: 4 2
16: 4 2
17: 5 1
18: 5 1
19: 5 1
20: 5 1
21: 6 3
22: 6 3
23: 6 3
24: 6 3
25: 7 2
26: 7 2
27: 7 2
28: 7 2
29: 8 99
30: 8 99
31: 8 99
32: 8 99
33: 9 1
34: 9 1
35: 9 1
36: 9 1
37: 10 0
38: 10 0
39: 10 0
40: 10 0
id class
# check result all.equal(df$class, res$class)
答案 1 :(得分:1)
这是一个简单的两步解决方案data.table
。
df[, class2 := min(class[class != 0 & class != 99]), by = id] # take the minimun value per group, excluding 0 and 99
df[, class_final := ifelse(is.infinite(class2), class, class2)] # take original value when is.infinite returns TRUE i.e. group with 0 or 99 only
all(df2$class == df$class_final) # check now
答案 2 :(得分:0)
Rcpp解决方案:
Name
现在检查答案是一样的。
你发布的答案:
df <- data.table(id=rep(1:10,each=4), class=c(1,1,1,1,1,1,1,99,0,0,0,1,0,2,2,2,99,99,99,1,3,3,3,0,2,2,0,99,99,99,99,99,1,1,1,1,0,0,0,0))
cppFunction('std::vector<int> remap_class(std::vector<int> id, std::vector<int> df_class) {
std::map<int, int> class_remap;
for(int i=1; i<id.size(); i++) {
if(df_class[i] != 0 & df_class[i] != 99) {
class_remap[id[i]] = df_class[i];
}
}
for(int i=1; i<id.size(); i++) {
if(class_remap.count(id[i]) != 0) {
df_class[i] = class_remap[id[i]];
}
}
return(df_class);
}')
df$class <- remap_class(df$id, df$class)
答案 3 :(得分:0)
以下是dplyr
+ tidyr
解决方案:
library(dplyr) # for mutate, group_by and `%>%`
library(tidyr) # for fill
df1 %>%
mutate(class2 = ifelse(class %in% c(0,99),NA,class)) %>% # we define new column with Nas to be able to use fill
group_by(id) %>%
fill(class2,.direction = "up") %>% # we fill up and down
fill(class2,.direction = "down") %>%
mutate(class2 = ifelse(is.na(class2),class,class2)) # we replace remaining NAs by initial value