如何计算客户在每个日期之前下达的订单数量

时间:2017-08-17 10:45:41

标签: r dataframe aggregate

我有两个数据集,一个较小的数据集:

OrderDate id no_of_orders_before_row_date 
01-Jul-17 1 0 
02-Jul-17 1 1 
02-Jul-17 2 0 
03-Jul-17 3 0 
01-Jul-17 4 0 
03-Jul-17 4 1 
05-Jul-17 5 0 
07-Jul-17 6 0 
09-Jul-17 2 1 
11-Jul-17 1 2 
13-Jul-17 4 2 
15-Jul-17 3 1

以及可在

下载的较大版本

https://docs.google.com/spreadsheets/d/1buF74VKwOj1-f_4hDPnP17vWqoUupMRnNz301laCLJM/edit#gid=0

enter image description here

请注意,较大的数据集未排序,并在同一天包含多个订单。

我正在寻找行日期前的订单数量。

使用的Excel公式为=COUNTIFS($L:$L,L5,$K:$K,"<"&K5)

其中列LOrderDate,列Kid

如何在R?

中执行此操作

2 个答案:

答案 0 :(得分:1)

OP要求在实际行日期之前计算每个id下的订单。

如果每天只有订单和ID,这相当于按OrderDate对data.frame进行排序,并连续编号属于特定id的所有行,从0开始。不幸的是,这仅适用于问题中提供的小样本数据集,但不适用于可从给定链接下载的较大数据集。

较大的数据集包含关联,即,有几种情况,其中一个客户在同一天下了多个订单。这里,简单方法将失败,因为它甚至在同一天计算行数。这可以使用rank()修复。

简单的解决方案适用于小数据集

library(data.table)
# coerce to data.table
setDT(DT1)[
  # convert character date to class Date to ensure correct sort order
  , OrderDate := lubridate::dmy(OrderDate)][
    # order by date, create new column with the row numbers for each id
    order(OrderDate), previous_orders := (1:.N) - 1L, by = id][]
     OrderDate id no_of_orders_before_row_date previous_orders
 1: 2017-07-01  1                            0               0
 2: 2017-07-02  1                            1               1
 3: 2017-07-02  2                            0               0
 4: 2017-07-03  3                            0               0
 5: 2017-07-01  4                            0               0
 6: 2017-07-03  4                            1               1
 7: 2017-07-05  5                            0               0
 8: 2017-07-07  6                            0               0
 9: 2017-07-09  2                            1               1
10: 2017-07-11  1                            2               2
11: 2017-07-13  4                            2               2
12: 2017-07-15  3                            1               1

简单的解决方案因较大的数据集而失败

请注意,下面的代码已针对不同的列名进行了修改,结果将进行排序以便更好地显示失败。

setDT(DT2)[, Order.Date := lubridate::dmy(Order.Date)][
  order(Order.Date), previous_orders := (1:.N) - 1L, by = Phone.Number][
    order(Phone.Number, Order.Date)]
    Order.Date Phone.Number Count previous_orders
 1: 2017-07-09   7353478602     0               0
 2: 2017-07-10   7353478602     1               1
 3: 2017-07-11   7353478602     2               2
 4: 2017-07-09   8123246689     0               0
 5: 2017-07-10   8123246689     1               1
 6: 2017-07-10   8123246689     1               2
 7: 2017-07-11   8123246689     3               3
 8: 2017-07-11   8123246689     3               4
 9: 2017-07-12   8123246689     5               5
10: 2017-07-08   8867413567     0               0
11: 2017-07-09   9036580445     0               0
12: 2017-07-11   9164539082     0               0
13: 2017-07-09   9538991240     0               0
14: 2017-07-08   9675623760     0               0
15: 2017-07-12   9845798557     0               0
16: 2017-07-12   9886668467     0               0
17: 2017-07-10   9886728132     0               0
18: 2017-07-12   9902789900     0               0

请注意第6行和第8行的差异。在这两种情况下,同一天的购买都包含在计数中。

使用rank()在同一天处理多次购买

以下修改后的代码会返回正确的结果:

setDT(DT2)[, Order.Date := lubridate::dmy(Order.Date)][
  order(Order.Date), 
  previous_orders := frank(Order.Date, ties.method = "min") - 1L, 
  by = Phone.Number][
    order(Phone.Number, Order.Date)]
    Order.Date Phone.Number Count previous_orders
 1: 2017-07-09   7353478602     0               0
 2: 2017-07-10   7353478602     1               1
 3: 2017-07-11   7353478602     2               2
 4: 2017-07-09   8123246689     0               0
 5: 2017-07-10   8123246689     1               1
 6: 2017-07-10   8123246689     1               1
 7: 2017-07-11   8123246689     3               3
 8: 2017-07-11   8123246689     3               3
 9: 2017-07-12   8123246689     5               5
10: 2017-07-08   8867413567     0               0
11: 2017-07-09   9036580445     0               0
12: 2017-07-11   9164539082     0               0
13: 2017-07-09   9538991240     0               0
14: 2017-07-08   9675623760     0               0
15: 2017-07-12   9845798557     0               0
16: 2017-07-12   9886668467     0               0
17: 2017-07-10   9886728132     0               0
18: 2017-07-12   9902789900     0               0

如果日期列已经具有正确的类Date,则对lubridate::dmy()的调用将创建NA并且必须省略,例如,

setDT(rawdata)[order(orderdate), previous_orders := (1:.N) - 1L, by = phone][order(phone, orderdate)] 

数据

问题中提供的小数据集:

DT1 <- structure(list(OrderDate = c("01-Jul-17", "02-Jul-17", "02-Jul-17", 
"03-Jul-17", "01-Jul-17", "03-Jul-17", "05-Jul-17", "07-Jul-17", 
"09-Jul-17", "11-Jul-17", "13-Jul-17", "15-Jul-17"), id = c(1L, 
1L, 2L, 3L, 4L, 4L, 5L, 6L, 2L, 1L, 4L, 3L), no_of_orders_before_row_date = c(0L, 
1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 2L, 2L, 1L)), .Names = c("OrderDate", 
"id", "no_of_orders_before_row_date"), row.names = c(NA, -12L
), class = "data.frame")

从给定链接下载的较大数据集为csv文件:

library(data.table)
DT2 <- fread("R doubts - Sheet1.csv", drop = 4L, skip = 1L, check.names = TRUE,
             colClasses = c("Phone Number" = "character"))

DT2 <- structure(list(Order.Date = c("8/7/2017", "9/7/2017", "10/7/2017", 
"11/7/2017", "12/7/2017", "9/7/2017", "10/7/2017", "11/7/2017", 
"12/7/2017", "9/7/2017", "10/7/2017", "11/7/2017", "12/7/2017", 
"9/7/2017", "10/7/2017", "11/7/2017", "12/7/2017", "8/7/2017"
), Phone.Number = c("9675623760", "9036580445", "7353478602", 
"7353478602", "9845798557", "7353478602", "8123246689", "9164539082", 
"9902789900", "9538991240", "9886728132", "8123246689", "8123246689", 
"8123246689", "8123246689", "8123246689", "9886668467", "8867413567"
), Count = c(0L, 0L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 3L, 
5L, 0L, 1L, 3L, 0L, 0L)), .Names = c("Order.Date", "Phone.Number", 
"Count"), row.names = c(NA, -18L), class = "data.frame")

答案 1 :(得分:0)

你可以用......

来做到这一点
df$OrderDate <- as.Date(df$OrderDate, format="%d-%b-%y") #your dates are type character
df$prevOrders <- sapply(1:nrow(df),function(i)
                   sum(df$OrderDate<df$OrderDate[i] & df$id==df$id[i]))

df
    OrderDate id prevOrders
1  2017-07-01  1          0
2  2017-07-02  1          1
3  2017-07-02  2          0
4  2017-07-03  3          0
5  2017-07-01  4          0
6  2017-07-03  4          1
7  2017-07-05  5          0
8  2017-07-07  6          0
9  2017-07-09  2          1
10 2017-07-11  1          2
11 2017-07-13  4          2
12 2017-07-15  3          1