Question

我有一个数据框df，其中包含汽车销售公司的数据。数据框包含特定日期的日期和销售数量。每个销售人员都有一个staff_id。假人inital_sell表示哪一天是该人的第一个工作日。

现在我想添加一个列months_since_start，该列会在该人开始工作后的每个月的每一天添加。然后我可以使用sells和months_since_start列来绘制自销售人员开始工作以来每个月的平均销售额（每个销售人员在第2个月销售，在第2个月......）。由于缺少某些日期和月份（例如，在假期期间，如示例的底部所示），我无法简单地添加序列以获取months_since_start。

date        year    month   staff_id   sells  initial_sell   months_since_start
2014-11-11  2014    11      1          3      1              1
2014-11-12  2014    11      1          1      0              1
2014-11-14  2014    11      1          1      0              1
2014-11-15  2014    11      1          2      0              1
...                     
2014-12-10  2014    12      1          2      0              1
2014-12-11  2014    12      1          1      0              2
...                     
2014-12-23  2014    12      2          1      1              1
2015-02-02  2015    2       2          4      0              2
2015-02-03  2015    2       2          1      0              2
...                     
2015-03-23  2015    3       2          3      0              4
...

有人可以帮助我如何获取month_since_start列吗？

Answer 1

假设输入按staff_id和date排序，如问题所示，并在注释的末尾显示。定义months函数，该函数给出工作人员的日期的排序向量，返回该成员自开始（即自第一个日期起）以来的月份。然后使用tapply将其应用于每个员工。 tapply会返回按staff_id排序的列表，因此请使用unlist对其进行解析。没有包使用。

Months <- function(date) {
  with(as.POSIXlt(date), 12 * (year - year[1]) + (mon - mon[1]) + (mday >= mday[1]))
}

transform(DF, months_since_start = unlist(tapply(date, staff_id, FUN = Months)))

，并提供：

         date year month staff_id sells initial_sell months_since_start
1  2014-11-11 2014    11        1     3            1                  1
2  2014-11-12 2014    11        1     1            0                  1
3  2014-11-14 2014    11        1     1            0                  1
4  2014-11-15 2014    11        1     2            0                  1
5  2014-12-10 2014    12        1     2            0                  1
6  2014-12-11 2014    12        1     1            0                  2
7  2014-12-23 2014    12        2     1            1                  1
8  2015-02-02 2015     2        2     4            0                  2
9  2015-02-03 2015     2        2     1            0                  2
10 2015-03-23 2015     3        2     3            0                  4

备用使用ave代替tapply的备选方案如下。 Months如上所述。 MonthsDF调用Months但接受行号而不是日期本身。此解决方案仍假定数据在date内按staff_id排序，但由于ave以与输入相同的顺序返回其输出，因此无需按staff_id排序。 ave的缺点是它不能以这里所需的方式处理"Date"类数据，这就是为什么我们使用行号作为MonthsDF的输入：

MonthsDF <- function(ix) Months(DF$date[ix])
transform(DF, months_since_start = ave(seq_along(date), staff_id, FUN = MonthsDF))

注意：使用了此输入：

Lines <- "date        year    month   staff_id   sells  initial_sell   
2014-11-11  2014    11      1          3      1              
2014-11-12  2014    11      1          1      0              
2014-11-14  2014    11      1          1      0              
2014-11-15  2014    11      1          2      0                            
2014-12-10  2014    12      1          2      0              
2014-12-11  2014    12      1          1      0              
2014-12-23  2014    12      2          1      1              
2015-02-02  2015    2       2          4      0              
2015-02-03  2015    2       2          1      0              
2015-03-23  2015    3       2          3      0"

DF <- read.table(text = Lines, header = TRUE)
DF$date <- as.Date(DF$date)

# in the question the input is already sorted by staff_id and date so
# the next two lines are not really needed but if we had non-sorted data
# then we should first sort it like this to be in the same form as in question
o <- with(DF, order(staff_id, date))
DF <- DF[o, ]

Answer 2

使用dplyr和lubridate的方法：

library(dplyr)
library(lubridate)
# some sample data
df <- data.frame(date = rep(seq(as.Date('2014-01-01'), as.Date('2014-04-04'), by = 30), 3),
                 staff_id = rep(1:3, each = 4))

所以df看起来像：

> head(df)
        date staff_id
1 2014-01-01        1
2 2014-01-31        1
3 2014-03-02        1
4 2014-04-01        1
5 2014-01-01        2
6 2014-01-31        2

现在使用dplyr到group_by staff_id，然后使用mutate添加列。在mutate内，months_since_start从time_length imum日期interval设置min staff_id，感谢{{1} }}和每行的group_by。将date的{{1}}设置为unit。

time_length

你得到：

month

如果您想要完成月份，请在df %>% group_by(staff_id) %>% mutate(months_since_start = time_length(interval(min(date), date), unit = 'month'))中换Source: local data frame [12 x 3] Groups: staff_id [3] date staff_id months_since_start (date) (int) (dbl) 1 2014-01-01 1 0.0000000 2 2014-01-31 1 0.9677419 3 2014-03-02 1 2.0322581 4 2014-04-01 1 3.0000000 5 2014-01-01 2 0.0000000 6 2014-01-31 2 0.9677419 7 2014-03-02 2 2.0322581 8 2014-04-01 2 3.0000000 9 2014-01-01 3 0.0000000 10 2014-01-31 3 0.9677419 11 2014-03-02 3 2.0322581 12 2014-04-01 3 3.0000000。

计算自数据框中启动以来的月数

2 个答案: