我在这里撞墙了。
我有一个dataframe
,很多行。
这是示意图。
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
我需要将dataframe
分组到我的ID
,然后选择具有最早日期的行,并将输出写入新数据框 - 保留所有行。
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
这就是我接近它的方式:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
要验证:我的结果数据框的nrow
应与unique
返回相同。
unique(myDf$ID) %>% length == nrow(test)
FALSE
不起作用。我试过这个:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
这确实会永远存在。我终止了它。
为什么第一种方法不起作用,什么是解决问题的好方法?
答案 0 :(得分:2)
考虑到你有一个非常大的数据集,我认为使用data.table会更好!这是解决问题的data.table版本,它比dplyr包更快:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS:您可以使用setDT(mydf)将data.frame转换为data.table。
答案 1 :(得分:1)
按“ID”分组后,我们可以使用which.min
获取“myDate”的索引(转换为Date
类后),然后使用slice
提取行。
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
答案 2 :(得分:0)
如果您只想使用基本功能,您还可以使用聚合和合并功能。
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)