Question

我的情况是，我正在尝试清理学生成绩的数据集以进行处理，并且我在完全删除重复项时遇到了一些问题，因为我只想查看＆＃34;首次尝试＆＃34;但有些学生已多次参加该课程。使用其中一个副本的数据示例如下：

        id     period                                           desc
632   1507       1101 90714 Research a contemporary biological issue
633   1507       1101         6317 Explain the process of speciation
634   1507       1101                  8931 Describe gene expression
14448 1507       1201                  8931 Describe gene expression
14449 1507       1201         6317 Explain the process of speciation
14450 1507       1201 90714 Research a contemporary biological issue
25884 1507       1301         6317 Explain the process of speciation
25885 1507       1301                  8931 Describe gene expression
25886 1507       1301 90714 Research a contemporary biological issue

reg_period的前两位数是他们坐在报纸上的那一年。可以看出，我希望保持id为1507且reg_period为1101的位置。到目前为止，我的代码示例是获取我想要修剪的值：

unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])

然而，我遇到了一些问题。这仅适用，因为数据按id和reg_period排序，未来无法保证。另外，我不知道如何获取此重复条目列表，然后选择不在其中的行，因为%in%似乎无法使用它并且循环使用{{1内存耗尽。

处理此问题的最佳方式是什么？

Answer 1

我可能会使用dplyr。调用您的数据df：

result = df %>% group_by(id) %>%
    filter(period == min(period))

如果您更喜欢base，我会将id / period组合保留到单独的数据框中，然后使用原始数据进行内部联接：

id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)

Answer 2

试试这个，它适用于我的数据：

dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)

输出：

    id period                                           desc
1 1507   1101 90714 Research a contemporary biological issue
4 1507   1201                  8931 Describe gene expression
7 1507   1301         6317 Explain the process of speciation

从R中的数据帧中删除重复项

2 个答案: