使用带有groupby和日期子集的data.table或dplyr来操纵R中的数据

时间:2016-11-20 02:39:38

标签: r data.table dplyr

我正在寻找data.table和/或dplyr的帮助。我有一个这样的数据框:

Name     Date          X      Y
Mike     2016-10-21    3.2    1.6
Mike     2016-10-23    3.1    1.4
Mike     2016-10-24    4.9    3.8
Mike     2016-10-25    5.7    4.2
Mike     2016-10-28    0.2   -1.1
Bob      2016-10-21    2.2   -1.1
Bob      2016-10-22    0.2   -3.6
Bob      2016-10-24   -9.2  -14.1
Bob      2016-10-25   -7.2  -12.1
Alice    2016-10-20    7.2    6.1
Alice    2016-10-21    2.2    0.1
Alice    2016-10-23   13.2    8.1
Alice    2016-10-25   12.6    8.8
Alice    2016-10-27    7.7    4.7
Alice    2016-10-28    8.2    5.0

我希望能够恢复X&的平均值。但是,对于每个人,我希望对其进行子集化,以便它仅使用每个人最近3个日期的值,忽略旧日期的数据。我还希望返回这3个最近日期之间的天数。理想情况下,我最终会得到一个这样的数据框:

Name     DaysBetween   avgX    avgY
Mike               4    3.6     2.3
Bob                3   -5.4    -9.9
Alice              3    9.5     6.2

编辑注释:此数据将始终按日期排序,因此我们也可以采用"最后3"每个人的数据点,而不是试图使用日期逻辑来找到哪三个是最新的。

提前感谢您的帮助!

3 个答案:

答案 0 :(得分:2)

我们可以使用data.table

library(data.table)
setDT(df1)[order(-Date), .(DaysBetween = as.integer(Date[1L] - Date[3L]), 
         avgX = mean(X[1:3]), avgY = round(mean(Y[1:3]),2)), by  = Name]
#    Name DaysBetween avgX  avgY
#1:  Mike           4  3.6  2.30
#2: Alice           3  9.5  6.17
#3:   Bob           3 -5.4 -9.93

答案 1 :(得分:1)

以上都是很好的回应,这是一个迭代的方法:

#initialize the output frame
outputFrame = as.data.frame(matrix(nrow = length(unique(train$Name)),
ncol = 4))

#renaming the data frame
names(outputFrame) = c("Names", "daysBetween", "avgX", "avgY")

#turn the date to a date
train$Date = as.Date(train$Date, "%m/%d/%Y")

#initialize the outputCounter
outputCounter = 1

#iterates over every unique Name in the data frame
for(name in as.character(unique(train$Name)))
{
    #subsets the dataframe into the values of each given level of Name
    dfSubset = train[which(train$Name == name),]

    #Orders the dataframe by date
    dfSubset = dfSubset[order(dfSubset$Date),]

    #get the 3 most recent dates
    dfSubset = dfSubset[(nrow(dfSubset) -2):nrow(dfSubset),]

    #fill the names
    outputFrame$Names[outputCounter] = name

    #fill the days between
    outputFrame$daysBetween[outputCounter] = as.numeric(max(dfSubset$Date) - min(dfSubset$Date))

    #get the average X
    outputFrame$avgX[outputCounter] = mean(dfSubset$X)

    #get the average Y
    outputFrame$avgY[outputCounter] = mean(dfSubset$Y)

    #increment outputCounter
    outputCounter = outputCounter +1
}

假设火车是你的数据帧

答案 2 :(得分:0)

您可以使用dplyr::top_n过滤数据:

library(dplyr)

df %>% mutate(Date = as.Date(Date)) %>%    # parse to Date class, if not already
    group_by(Name) %>% 
    top_n(3, Date) %>%    # filter to max 3 dates for each group
    summarise(DaysBetween = max(Date) - min(Date), 
              avgX = mean(X), 
              avgY = mean(Y))

## # A tibble: 3 × 4
##     Name DaysBetween  avgX      avgY
##   <fctr>      <time> <dbl>     <dbl>
## 1  Alice      3 days   9.5  6.166667
## 2    Bob      3 days  -5.4 -9.933333
## 3   Mike      4 days   3.6  2.300000