Question

我有一组关于在r的data.frame中使用唯一ID的用户数量的数据。

ID        start date         end date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
 ....
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000

每个ID的时间序列在不一致的时间开始和结束，并且包含不一致的观察数量。但是，它们都是以上述方式格式化的;开始日期和结束日期是日期对象。

我希望将每个ID的细分标准化为每月时间序列，每月开始时的数据点，权衡观察到的数量，这些数字恰好跨越了两个月或更长时间。换句话说，我想把这个系列变成类似

的东西

ID        start date         end date        amount
1         1-1-2012          2-1-2012       3096 = 6000 * 16/31
1         2-1-2012          3-1-2012       4339 = 6000*15/31+4000*14/39
1         3-1-2012          4-1-2012       etc
 ....
1         6-1-2012          7-1-2012       etc
2         1-1-2012          2-1-2012       etc
2         2-1-2012          3-1-2012       etc
2         3-1-2012          4-1-2012       etc
2         4-1-2012          5-1-2012       etc
2         5-1-2012          6-1-2012       etc
 ....
10000     1-1-2012          2-1-2012       etc
 ....
10000     6-1-2012          7-1-2012       etc

通过权衡2月份（15天）1-15-2012至2-15-2012观测天数来计算2/1/12和3/1/12之间的ID 1值/ 31天），观察范围内的数量（6000），2月份的2-15到3-25观测范围内的天数（14天/ 39天，2012年是闰年）乘以在该观察范围内的数量（4000），产生6000 * 15/31 + 4000 * 14/39 = 4339.这应该针对每个ID时间序列进行。我们不考虑观察期全部适合一个月的情况;但如果它们分布超过两个月，则应在适当的称重时间内分开这几个月。

我对r很新，当然可以对此有所帮助！

Answer 1

这是使用原生R：

#The data
df=read.table(text='ID        start_date         end_date        amount
1         1-15-2012          2-15-2012       6000
1         2-15-2012          3-25-2012       4000
1         3-25-2012          5-26-2012       3000
1         5-26-2012          6-13-2012       1000
2         1-16-2012          2-27-2012       7000
2         2-27-2012          3-18-2012       2000
2         3-18-2012          5-23-2012       3000
10000     1-12-2012          2-24-2012       12000
10000     2-24-2012          3-11-2012       22000
10000     3-11-2012          5-27-2012       33000
10000     5-27-2012          6-10-2012       5000',
              header=T,row.names = NULL,stringsAsFactors =FALSE)

df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")

df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot

#using by compute new amount
df1$amount=do.call(c,
  by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
        )
df1

> df1
      ID     startm       endm    amount
1      1 2012-01-01 2012-02-01  3096.774
2      1 2012-02-01 2012-03-01  4339.123
3      1 2012-03-01 2012-05-01  4306.038
4      1 2012-05-01 2012-06-01  1535.842
5      2 2012-01-01 2012-02-01  2500.000
6      2 2012-02-01 2012-03-01  4700.000
7      2 2012-03-01 2012-05-01  3754.545
8  10000 2012-01-01 2012-02-01  5302.326
9  10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000

Answer 2

要解决这个问题，我认为最简单的方法是将其分解为两个问题。

如何每日细分我感兴趣的数字？这是我根据您提供的信息做出的假设。
如何按日期范围分组并总结我感兴趣的内容？

对于以下示例，我将使用我使用以下代码创建的数据集：

df <- data.frame(
  id=c(1,1,1,1,2,2,2),
  start_date=as.Date(c("1-15-2012",
                       "2-15-2012",
                       "3-25-2012",
                       "5-26-2012",
                       "1-16-2012",
                       "2-27-2012",
                       "3-18-2012"), "%m-%d-%Y"),
  end_date=as.Date(c("2-15-2012",
                     "3-25-2012",
                     "5-26-2012",
                     "6-13-2012",
                     "2-27-2012",
                     "3-18-2012",
                     "5-23-2012"), "%m-%d-%Y"),
  amount=c(6000,
           4000,
           3000,
           1000,
           7000,
           2000,
           3000)  
  )

<强> 1。提供每日数据

为了提供每日数据，首先我们得到每日贡献：

df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)

然后，我们将使用开始和结束日期扩展日期范围。有一个couple ways which you can do it，但看到您使用我们拥有的dplyr方式应用了dplyr代码：

library(dplyr)
df <- df %>%
  rowwise() %>%
  do(data.frame(id=.$id, 
                date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")), 
                daily_contribution=.$daily_contribution))

有一些看起来像这样的输出：

Source: local data frame [285 x 3]
Groups: <by row>

   id       date daily_contribution
1   1 2012-01-15           193.5484
2   1 2012-01-16           193.5484
3   1 2012-01-17           193.5484
4   1 2012-01-18           193.5484
5   1 2012-01-19           193.5484
6   1 2012-01-20           193.5484
7   1 2012-01-21           193.5484
8   1 2012-01-22           193.5484
9   1 2012-01-23           193.5484
10  1 2012-01-24           193.5484
.. ..        ...                ...

<强> 2。创建分组变量

接下来，我们创建了一些我们感兴趣的分组变量。我已经使用lubridate来轻松获取日期的月份和年份：

library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)

现在，通过所有这些，我们可以轻松地使用dplyr按要求按日期汇总我们的信息。

df %>% 
  group_by(id, mnth, yr) %>%
  summarise(amount=sum(daily_contribution))

带输出：

Source: local data frame [11 x 4]
Groups: id, mnth

   id mnth   yr    amount
1   1    1 2012 3290.3226
2   1    2 2012 4441.6873
3   1    3 2012 2902.8122
4   1    4 2012 1451.6129
5   1    5 2012 1591.3978
6   1    6 2012  722.2222
7   2    1 2012 2666.6667
8   2    2 2012 4800.0000
9   2    3 2012 2436.3636
10  2    4 2012 1363.6364
11  2    5 2012 1045.4545

以您指定的格式精确地获取它：

df %>% rowwise() %>%
  mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
         end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
  select(id, start_date, end_date, amount)

带输出：

Source: local data frame [11 x 4]
Groups: <by row>

   id start_date   end_date    amount
1   1 2012-01-01 2012-02-01 3290.3226
2   1 2012-02-01 2012-03-01 4441.6873
3   1 2012-03-01 2012-04-01 2902.8122
4   1 2012-04-01 2012-05-01 1451.6129
5   1 2012-05-01 2012-06-01 1591.3978
6   1 2012-06-01 2012-07-01  722.2222
7   2 2012-01-01 2012-02-01 2666.6667
8   2 2012-02-01 2012-03-01 4800.0000
9   2 2012-03-01 2012-04-01 2436.3636
10  2 2012-04-01 2012-05-01 1363.6364
11  2 2012-05-01 2012-06-01 1045.4545

根据需要。

note ：我可以从您的示例中看到，您有3096 = 6000 * 16/31和4339 = 6000*15/31+4000*14/39，但对于第一个，例如，您有1月的15如果日期范围包含在1月31日，则为17天。如果需要，您可以简单地更改此信息。

Answer 3

以下是使用plyr和reshape的解决方案。这些数字与您提供的数字不同，所以我可能误解了您的意图，虽然这似乎符合您的既定目标（按月加权平均数）。

df$index <- 1:nrow(df) #Create a unique index number

#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")

library(plyr); library(reshape)  #Load the libraries

#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
#    ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) { 
  ID <- X$ID        #Keep the ID value
  n.days <- as.numeric(difftime( X$end.date, X$start.date ))  #Calculate time difference in days, report the result as a number
  day <- seq(X$start.date, X$end.date, by="days")   #Sequence of days
  amount.per.day <- X$amount/n.days      #Amount for that day
  data.frame(ID, day, amount.per.day)    #Last line is the output
})

#Change list back into data.frame
df3 <- ldply(df2, data.frame)   #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <-  as.numeric(format(df3$day, "%m"))   #Assign a month to all dates

#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))

#       ID mon    amount
#    1   1   1 3290.3226
#    2   1   2 4441.6873
#    3   1   3 2902.8122
#    4   1   4 1451.6129
#    5   1   5 1591.3978
#    6   1   6  722.2222
#    7   2   1 2666.6667
#    8   2   2 4800.0000
#    9   2   3 2436.3636
#    10  2   4 1363.6364
#    11  2   5 1045.4545

顺便说一句，对于将来的帖子，如果您提供复制数据的代码，则可以获得更快的答案。如果您的代码有点复杂，可以使用dput(yourdata)。 HTH！

R编程 - 将一组按ID编制的时间序列分成不规则的观察期，定期进行月度观察

3 个答案: