我正在寻找data.table和/或dplyr的帮助。我有一个这样的数据框:
Name Date X Y
Mike 2016-10-21 3.2 1.6
Mike 2016-10-23 3.1 1.4
Mike 2016-10-24 4.9 3.8
Mike 2016-10-25 5.7 4.2
Mike 2016-10-28 0.2 -1.1
Bob 2016-10-21 2.2 -1.1
Bob 2016-10-22 0.2 -3.6
Bob 2016-10-24 -9.2 -14.1
Bob 2016-10-25 -7.2 -12.1
Alice 2016-10-20 7.2 6.1
Alice 2016-10-21 2.2 0.1
Alice 2016-10-23 13.2 8.1
Alice 2016-10-25 12.6 8.8
Alice 2016-10-27 7.7 4.7
Alice 2016-10-28 8.2 5.0
我希望能够恢复X&的平均值。但是,对于每个人,我希望对其进行子集化,以便它仅使用每个人最近3个日期的值,忽略旧日期的数据。我还希望返回这3个最近日期之间的天数。理想情况下,我最终会得到一个这样的数据框:
Name DaysBetween avgX avgY
Mike 4 3.6 2.3
Bob 3 -5.4 -9.9
Alice 3 9.5 6.2
编辑注释:此数据将始终按日期排序,因此我们也可以采用"最后3"每个人的数据点,而不是试图使用日期逻辑来找到哪三个是最新的。
提前感谢您的帮助!
答案 0 :(得分:2)
我们可以使用data.table
library(data.table)
setDT(df1)[order(-Date), .(DaysBetween = as.integer(Date[1L] - Date[3L]),
avgX = mean(X[1:3]), avgY = round(mean(Y[1:3]),2)), by = Name]
# Name DaysBetween avgX avgY
#1: Mike 4 3.6 2.30
#2: Alice 3 9.5 6.17
#3: Bob 3 -5.4 -9.93
答案 1 :(得分:1)
以上都是很好的回应,这是一个迭代的方法:
#initialize the output frame
outputFrame = as.data.frame(matrix(nrow = length(unique(train$Name)),
ncol = 4))
#renaming the data frame
names(outputFrame) = c("Names", "daysBetween", "avgX", "avgY")
#turn the date to a date
train$Date = as.Date(train$Date, "%m/%d/%Y")
#initialize the outputCounter
outputCounter = 1
#iterates over every unique Name in the data frame
for(name in as.character(unique(train$Name)))
{
#subsets the dataframe into the values of each given level of Name
dfSubset = train[which(train$Name == name),]
#Orders the dataframe by date
dfSubset = dfSubset[order(dfSubset$Date),]
#get the 3 most recent dates
dfSubset = dfSubset[(nrow(dfSubset) -2):nrow(dfSubset),]
#fill the names
outputFrame$Names[outputCounter] = name
#fill the days between
outputFrame$daysBetween[outputCounter] = as.numeric(max(dfSubset$Date) - min(dfSubset$Date))
#get the average X
outputFrame$avgX[outputCounter] = mean(dfSubset$X)
#get the average Y
outputFrame$avgY[outputCounter] = mean(dfSubset$Y)
#increment outputCounter
outputCounter = outputCounter +1
}
假设火车是你的数据帧
答案 2 :(得分:0)
您可以使用dplyr::top_n
过滤数据:
library(dplyr)
df %>% mutate(Date = as.Date(Date)) %>% # parse to Date class, if not already
group_by(Name) %>%
top_n(3, Date) %>% # filter to max 3 dates for each group
summarise(DaysBetween = max(Date) - min(Date),
avgX = mean(X),
avgY = mean(Y))
## # A tibble: 3 × 4
## Name DaysBetween avgX avgY
## <fctr> <time> <dbl> <dbl>
## 1 Alice 3 days 9.5 6.166667
## 2 Bob 3 days -5.4 -9.933333
## 3 Mike 4 days 3.6 2.300000