我需要从拥有1亿行的大数据框中删除重复项。我正在测试data.table是否可以帮助我。但是,在以下代码中,data.table中的unique()不会生成与data.frame的unique()相同的结果。 data.table中的setkey中是否存在可能的错误?
library(data.table)
tmp <- data.frame(id=c(1000000128152, 1000000228976, 1000000235508, 1000000294933, 1000000311288, 1000000353770, 1000000441585, 1000000466482, 1000000473521,
1000000491353, 1000000497787, 1000000534948, 1000000589071, 1000000622890, 1000000658287, 1000000695865, 1000000731674, 1000000780659,
1000000818218, 1000000834389, 1000000877189, 1000000937770, 1000000937770, 1000000996135, 1000001061831, 1000001062057, 1000001065241,
1000001097542, 1000001122242, 1000001177167, 1000001194078, 1000001216323, 1000001232155, 1000001294998, 1000001361126, 1000001361126,
1000001389830, 1000001411284, 1000001415793, 1000001417557, 1000001485326, 1000001565513, 1000001624601, 1000001650282, 1000001681805,
1000001683548, 1000001683548, 1000001693445, 1000001693455, 1000001693462, 1000001693466, 1000001693490, 1000001693490, 1000001703493,
1000001703511, 1000001703518, 1000001703546, 1000001703554, 1000001703613, 1000001703644))
unique(tmp$id)
DT <- data.table(tmp)
setkey(DT, id)
DTU <- unique(DT)
DTU$id
Results from the unique(tmp$id):
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693466
[49] 1000001693490 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
Result from DTU$id:
[1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693490
[49] 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644
比较两者,我们看到1000001693466错误地丢弃了DTU。有什么建议吗?我怀疑它是setkey,因为当我从所有数字中减去1000000000000时,结果是一样的。
答案 0 :(得分:8)
编辑(来自Arun):默认的舍入功能已在current development version of data.table, v1.9.7中删除,并可能保持这种方式前进。有关安装说明,请参阅here。
这也意味着您完全有责任理解浮点数的表示限制并处理它们: - )。
help(setkey)
说(data.table version 1.9.6
):
请注意,默认情况下,数字类型的列(即double)在计算顺序时将其最后两个字节四舍五入,以避免由于精确表示浮点数的限制而导致的任何意外行为。看看setNumericRounding了解更多信息。
在键入之前将舍入更改为1个字节
DT <- data.table(tmp)
setNumericRounding(1) # set rounding
setkey(DT, id)
该值不再被删除。
但是,help(setNumericRounding)
说
对于大数字(整数> 2 ^ 31),我们建议使用bit64 :: integer64而不是将舍入设置为0。