在组

时间:2018-02-05 17:18:09

标签: r data.table

我有一个数据表,其中有很多个人(id)已被问过n次问题(类)。有时候他们的回答是099(这些是#34的非回答代码;拒绝回答"以及"未知",但是当他们稍后被问到时回答这个问题。

如何替换ID中的099

虚拟数据:

library(data.table)
df <- data.table(
  id=rep(1:10,each=4), 
  class=c(1,1,1,1,1,1,1,99,0,0,0,1,0,2,2,2,99,99,99,
    1,3,3,3,0,2,2,0,99,99,99,99,99,1,1,1,1,0,0,0,0))

我想得到什么

res <- data.table(
  id=rep(1:10,each=4), 
  class=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,3,
    2,2,2,2,99,99,99,99,1,1,1,1,0,0,0,0))

想象示例......

> cbind(df, res = res[, !"id"])

    id class res.class
 1:  1     1         1
 2:  1     1         1
 3:  1     1         1
 4:  1     1         1
 5:  2     1         1
 6:  2     1         1
 7:  2     1         1
 8:  2    99         1
 9:  3     0         1
10:  3     0         1
11:  3     0         1
12:  3     1         1
13:  4     0         2
14:  4     2         2
15:  4     2         2
16:  4     2         2
17:  5    99         1
18:  5    99         1
19:  5    99         1
20:  5     1         1
21:  6     3         3
22:  6     3         3
23:  6     3         3
24:  6     0         3
25:  7     2         2
26:  7     2         2
27:  7     0         2
28:  7    99         2
29:  8    99        99
30:  8    99        99
31:  8    99        99
32:  8    99        99
33:  9     1         1
34:  9     1         1
35:  9     1         1
36:  9     1         1
37: 10     0         0
38: 10     0         0
39: 10     0         0
40: 10     0         0
    id class res.class

在实践中,我有大约100,000个人,这就是我标记为的原因,尽管我对其他(更快)的建议持开放态度。

4 个答案:

答案 0 :(得分:2)

使用data.table,这也可以通过更新来解决,同时使用每个id的查找表加入,替换{{1}中的所有class值通过查找表的相应值。

查找表由

创建
df
unique(df[!class %in% c(0,99)], by="id")

查找表仅包含具有至少一个有效答案的 id class 1: 1 1 2: 2 1 3: 3 1 4: 4 2 5: 5 1 6: 6 3 7: 7 2 8: 9 1 个条目。在随后的更新加入中,没有任何有效答案的其他id将保持不变。

id
df[unique(df[!class %in% c(0,99)], by="id"), on = "id", class := i.class][]
    id class
 1:  1     1
 2:  1     1
 3:  1     1
 4:  1     1
 5:  2     1
 6:  2     1
 7:  2     1
 8:  2     1
 9:  3     1
10:  3     1
11:  3     1
12:  3     1
13:  4     2
14:  4     2
15:  4     2
16:  4     2
17:  5     1
18:  5     1
19:  5     1
20:  5     1
21:  6     3
22:  6     3
23:  6     3
24:  6     3
25:  7     2
26:  7     2
27:  7     2
28:  7     2
29:  8    99
30:  8    99
31:  8    99
32:  8    99
33:  9     1
34:  9     1
35:  9     1
36:  9     1
37: 10     0
38: 10     0
39: 10     0
40: 10     0
    id class
# check result
all.equal(df$class, res$class)

答案 1 :(得分:1)

这是一个简单的两步解决方案data.table

df[, class2 := min(class[class != 0 & class != 99]), by = id] # take the minimun value per group, excluding 0 and 99
df[, class_final := ifelse(is.infinite(class2), class, class2)] # take original value when is.infinite returns TRUE i.e. group with 0 or 99 only

all(df2$class == df$class_final) # check now 

答案 2 :(得分:0)

Rcpp解决方案:

Name

现在检查答案是一样的。

你发布的答案:

df <- data.table(id=rep(1:10,each=4), class=c(1,1,1,1,1,1,1,99,0,0,0,1,0,2,2,2,99,99,99,1,3,3,3,0,2,2,0,99,99,99,99,99,1,1,1,1,0,0,0,0))

cppFunction('std::vector<int> remap_class(std::vector<int> id, std::vector<int> df_class) {
  std::map<int, int> class_remap;
  for(int i=1; i<id.size(); i++) {
    if(df_class[i] != 0 & df_class[i] != 99) {
      class_remap[id[i]] = df_class[i];
    }
  }
  for(int i=1; i<id.size(); i++) {
    if(class_remap.count(id[i]) != 0) {
        df_class[i] = class_remap[id[i]];
      }
  }
  return(df_class);
}')

df$class <- remap_class(df$id, df$class)

答案 3 :(得分:0)

以下是dplyr + tidyr解决方案:

library(dplyr) # for mutate, group_by and `%>%`
library(tidyr) # for fill
df1 %>%
  mutate(class2 = ifelse(class %in% c(0,99),NA,class)) %>% # we define new column with Nas to be able to use fill
  group_by(id) %>%
  fill(class2,.direction = "up")   %>% # we fill up and down
  fill(class2,.direction = "down") %>%
  mutate(class2 = ifelse(is.na(class2),class,class2)) # we replace remaining NAs by initial value