我在R中有一个data.table,比如dt,看起来像:
> dt <- data.table(adr = c("A", "A", "A","A","A","A","A","B", "B", "C", "C", "C", "D", "E", "E"),
code=c("0001","0001","0001","0001","0001","0001","0001","0001","0001", "0002", "0002", "0002", "0003", "0003", "0003"),
num = c(1,67,875,467,986,34,987,876,785, 67,9078,45,907,451,987))
> dt
adr code num
1: A 0001 1
2: A 0001 67
3: A 0001 875
4: A 0001 467
5: A 0001 986
6: A 0001 34
7: A 0001 987
8: B 0001 876
9: B 0001 785
10: C 0002 67
11: C 0002 9078
12: C 0002 45
13: D 0003 907
14: E 0003 451
15: E 0003 987
对于单个值code
,可以有adr
的单个值。例如,对于code = 0001
,我们有两个adr A
和B
。这是错的。 adr
及其相关记录是正确的,其中大部分都出现在该特定代码中(超过50%)。
因此对于代码0001,adr A是7次而adr B是2次,因此adr B及其关联记录是错误的。我想找到这个,并希望删除每个代码的错误记录。
输出必须如下:
> dt
adr code num
1: A 0001 1
2: A 0001 67
3: A 0001 875
4: A 0001 467
5: A 0001 986
6: A 0001 34
7: A 0001 987
8: C 0002 67
9: C 0002 9078
10: C 0002 45
11: E 0003 451
12: E 0003 987
如何在R中使用data.table
执行此操作答案 0 :(得分:0)
我已将dt
设为data.frame()
而不是data.table()
,因此我无需加载其他包,但您可以按以下方式完成此操作:
require(dplyr)
dt <- dt %>% group_by(code, adr) %>% mutate(count = n()) %>% group_by(code) %>% filter(count == max(count)) %>% select(-count)