Question

我有一个具有以下形式的数据集

 V1   V2    V3          V4
999   53 2015-07-02     2
999   53 2011-07-03     3
998   56 2015-03-08     4
998   56 2011-03-18     5
998   58 2014-12-26     6
998   57 2016-05-21     8
998   57 2015-04-12     9
998   58 2013-09-29     10
997   63 2013-09-28     19
997   63 2014-08-21     20

请注意，重复始终会显示在V1和V2列中(999, 53)和(998,56)等等。另请注意，V3是一个日期。因此，构成重复的两个条目出现在两个不同的时间。

我想从上面的数据集创建两个数据框，一个包含重复项的早期条目，另一个包含旧委托。即，我想最终得到以下两个数据帧

“旧”

999   53 2011-07-03     3
998   56 2011-03-18     5
998   57 2015-04-12     9
998   58 2013-09-29     10
997   63 2013-09-28     19

和“早期”

999   53 2015-07-02     2
998   56 2015-03-08     4
998   58 2014-12-26     6
998   57 2016-05-21     8
997   63 2014-08-21     20

我当然可以为此使用两个for - 循环，但我的数据非常大，因此效率很低。还有其他方法可以实现这一目标吗？

Answer 1

正如Jealie在评论中指出的那样，对于这些解决方案，df必须先在V3上进行排序。

df = df[order(df$V3),]

您可以立即拆分

split(df, duplicated(df[,1:2]))
或duplicated与V1和V2分开使用

df[!duplicated(df[,1:2]),]
df[duplicated(df[,1:2]),]
或使用ave确定重复对是否第一次出现或第二次出现并直接出现子集。

df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) ==1,]
df[ave(seq_along(df$V1), paste(df$V1, df$V2, sep = "-"), FUN = seq_along) == 2,]

数据

df = structure(list(V1 = c(999L, 999L, 998L, 998L, 998L, 998L, 998L, 998L, 997L, 997L), V2 = c(53L, 53L, 56L, 56L, 58L, 57L, 57L, 58L, 63L, 63L), V3 = c("2015-07-02", "2011-07-03", "2015-03-08", "2011-03-18", "2014-12-26", "2016-05-21", "2015-04-12", "2013-09-29", "2013-09-28", "2014-08-21"), V4 = c(2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 19L, 20L)), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -10L))

Answer 2

只要你只有配对，这就行了。

# get the positions of the rows sorted by V2 and then V3
myOrd <- with(df, order(V2, V3))

# Keep the first observation of each pair (early)
df[myOrd[c(TRUE, FALSE)],]
   V1 V2         V3 V4
2 999 53 2011-07-03  3
4 998 56 2011-03-18  5
7 998 57 2015-04-12  9
8 998 58 2013-09-29 10
9 997 63 2013-09-28 19

# Keep the second observation of each pair (late)
df[myOrd[c(FALSE, TRUE)],]
    V1 V2         V3 V4
1  999 53 2015-07-02  2
3  998 56 2015-03-08  4
6  998 57 2016-05-21  8
5  998 58 2014-12-26  6
10 997 63 2014-08-21 20

此处，order用于查找已排序的观察的位置。然后使用c(TRUE, FALSE)和c(FALSE, TRUE)来提取所需的行。

Answer 3

使用data.table：

可以非常有效地完成

require('data.table') # if needed, install before with install.packages('data.table') 

dt = data.table(your_data_frame)

dt[, type := ifelse(V3==min(V3),'old','new'), keyby=c('V1','V2')]

这将创建一个包含输入状态的新列：

> dt
     V1 V2         V3 V4 type
 1: 997 63 2013-09-28 19  old
 2: 997 63 2014-08-21 20  new
 3: 998 56 2015-03-08  4  new
 4: 998 56 2011-03-18  5  old
 5: 998 57 2016-05-21  8  new
 6: 998 57 2015-04-12  9  old
 7: 998 58 2014-12-26  6  new
 8: 998 58 2013-09-29 10  old
 9: 999 53 2015-07-02  2  new
10: 999 53 2011-07-03  3  old

然后，您可以使用dt[type == 'new']或dt[type == 'old']

来对数据进行子集化

根据日期提取数据框条目

3 个答案: