选择分组数据的最小数据 - 保留所有列

时间:2015-10-29 13:15:16

标签: r dplyr plyr

我在这里撞墙了。

我有一个dataframe,很多行。 这是示意图。

#myDf
ID    c1    c2    myDate
A     1     1     01.01.2015
A     2     2     02.02.2014
A     3     3     03.01.2014
B     4     4     09.09.2009
B     5     5     10.10.2010
C     6     6     06.06.2011
....

我需要将dataframe分组到我的ID,然后选择具有最早日期的行,并将输出写入新数据框 - 保留所有行。

ID    c1    c2    myDate
A     3     3     03.01.2014
B     4     4     09.09.2009
C     6     6     06.06.2011
....

这就是我接近它的方式:

test <- myDf %>%
    group_by(ID) %>%
    mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
    filter(date == min(b2))

要验证:我的结果数据框的nrow应与unique返回相同。

unique(myDf$ID) %>% length == nrow(test)
  

FALSE

不起作用。我试过这个:

newDf <- ddply(.data = myDf,
              .variables = "ID",
              .fun = function(piece){
                  take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
                  piece[take.this.row,]
                  })

这确实会永远存在。我终止了它。

为什么第一种方法不起作用,什么是解决问题的好方法?

3 个答案:

答案 0 :(得分:2)

考虑到你有一个非常大的数据集,我认为使用data.table会更好!这是解决问题的data.table版本,它比dplyr包更快:

library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
                 myDate=c("01.01.2015","02.02.2014",
                          "03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]

> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
   ID c1 c2     myDate
1:  A  3  3 2014-01-03
2:  B  4  4 2009-09-09
3:  C  6  6 2011-06-06

PS:您可以使用setDT(mydf)将data.frame转换为data.table。

答案 1 :(得分:1)

按“ID”分组后,我们可以使用which.min获取“myDate”的索引(转换为Date类后),然后使用slice提取行。

library(dplyr)
df1 %>% 
   group_by(ID) %>% 
   slice(which.min(as.Date(myDate, '%d.%m.%Y')))
#     ID    c1    c2     myDate
#  (chr) (int) (int)      (chr)
#1     A     3     3 03.01.2014
#2     B     4     4 09.09.2009
#3     C     6     6 06.06.2011

数据

df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6, 
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014", 
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID", 
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA, 
 -6L))

答案 2 :(得分:0)

如果您只想使用基本功能,您还可以使用聚合和合并功能。

# data (from response above)

df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6, 
                  c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014", 
                                       "09.09.2009", "10.10.2010", "06.06.2011")),
             .Names = c("ID","c1", "c2", "myDate"),
             class = "data.frame", row.names = c(NA,-6L))

# convert your date column to POSIXct object

df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")

# Use the aggregate function to look for the minimum dates by group. 
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"

df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
            function(x){x[which(x == min(x))]})

df2

# Use the merge function to merge your original data frame with the
# data from the aggregate function

merge(df1,df2)