Question

我有一个项目的历史文件。如果我们称这个数据框为A，假设它是5000行乘20列，那么我需要不断地在A中添加新记录。

如果我想测试我删除重复项，我的测试是：

A <- rbind(A, A) #These are the exact same file. This is now 10,000 x 20

A <- dplyr::filter(A, !duplicated(A)) #using the dplyr package, this is now 5,000 x 20

请记住，以上工作原理应该如此。它删除所有重复项。但是，当我想测试这个工作时，我保存文件，然后再次读取它并再次rbind：

readr::write_csv(A, "path/A_saved") #Saving the historical file A

A_import <- readr::read_csv("A_saved") #Loading in the historical file I just saved to my computer

A <- rbind(A, A_import) #Again, these are still the exact same file, same dimension with the same records, each with a duplicate row. This is now 10,000 by 20

A <- dplyr::filter(A, !duplicated(A)) #Same as above, BUT this is now 6,000 x 20

它正在删除大部分重复项。但是，它不会删除所有重复项。经过检查，应该删除的1,000行仍然是其他行的精确副本。

在此实例中使用read_csv（）和duplicated（）函数时会发生什么？我搜索了类似的问题，无法找到解决方案。

我使用了unique（）函数而不是duplicated（），问题仍然存在。当我读入完全相同的数据帧时，rbind（）数据帧对象及其read_csv（）版本，然后尝试过滤（）重复，而不是所有重复项都被过滤。

duplicated（）未将所有行识别为正确的重复项。

有什么想法吗？

提前谢谢。

使用read_csv（）后，重复的行不再被识别为重复行

0 个答案: