我有两个数据集,一个较小的数据集:
OrderDate id no_of_orders_before_row_date
01-Jul-17 1 0
02-Jul-17 1 1
02-Jul-17 2 0
03-Jul-17 3 0
01-Jul-17 4 0
03-Jul-17 4 1
05-Jul-17 5 0
07-Jul-17 6 0
09-Jul-17 2 1
11-Jul-17 1 2
13-Jul-17 4 2
15-Jul-17 3 1
以及可在
下载的较大版本https://docs.google.com/spreadsheets/d/1buF74VKwOj1-f_4hDPnP17vWqoUupMRnNz301laCLJM/edit#gid=0
请注意,较大的数据集未排序,并在同一天包含多个订单。
我正在寻找行日期前的订单数量。
使用的Excel公式为=COUNTIFS($L:$L,L5,$K:$K,"<"&K5)
其中列L
为OrderDate
,列K
为id
。
如何在R?
中执行此操作答案 0 :(得分:1)
OP要求在实际行日期之前计算每个id
下的订单。
如果每天只有订单和ID,这相当于按OrderDate
对data.frame进行排序,并连续编号属于特定id
的所有行,从0开始。不幸的是,这仅适用于问题中提供的小样本数据集,但不适用于可从给定链接下载的较大数据集。
较大的数据集包含关联,即,有几种情况,其中一个客户在同一天下了多个订单。这里,简单方法将失败,因为它甚至在同一天计算行数。这可以使用rank()
修复。
library(data.table)
# coerce to data.table
setDT(DT1)[
# convert character date to class Date to ensure correct sort order
, OrderDate := lubridate::dmy(OrderDate)][
# order by date, create new column with the row numbers for each id
order(OrderDate), previous_orders := (1:.N) - 1L, by = id][]
OrderDate id no_of_orders_before_row_date previous_orders 1: 2017-07-01 1 0 0 2: 2017-07-02 1 1 1 3: 2017-07-02 2 0 0 4: 2017-07-03 3 0 0 5: 2017-07-01 4 0 0 6: 2017-07-03 4 1 1 7: 2017-07-05 5 0 0 8: 2017-07-07 6 0 0 9: 2017-07-09 2 1 1 10: 2017-07-11 1 2 2 11: 2017-07-13 4 2 2 12: 2017-07-15 3 1 1
请注意,下面的代码已针对不同的列名进行了修改,结果将进行排序以便更好地显示失败。
setDT(DT2)[, Order.Date := lubridate::dmy(Order.Date)][
order(Order.Date), previous_orders := (1:.N) - 1L, by = Phone.Number][
order(Phone.Number, Order.Date)]
Order.Date Phone.Number Count previous_orders 1: 2017-07-09 7353478602 0 0 2: 2017-07-10 7353478602 1 1 3: 2017-07-11 7353478602 2 2 4: 2017-07-09 8123246689 0 0 5: 2017-07-10 8123246689 1 1 6: 2017-07-10 8123246689 1 2 7: 2017-07-11 8123246689 3 3 8: 2017-07-11 8123246689 3 4 9: 2017-07-12 8123246689 5 5 10: 2017-07-08 8867413567 0 0 11: 2017-07-09 9036580445 0 0 12: 2017-07-11 9164539082 0 0 13: 2017-07-09 9538991240 0 0 14: 2017-07-08 9675623760 0 0 15: 2017-07-12 9845798557 0 0 16: 2017-07-12 9886668467 0 0 17: 2017-07-10 9886728132 0 0 18: 2017-07-12 9902789900 0 0
请注意第6行和第8行的差异。在这两种情况下,同一天的购买都包含在计数中。
rank()
在同一天处理多次购买以下修改后的代码会返回正确的结果:
setDT(DT2)[, Order.Date := lubridate::dmy(Order.Date)][
order(Order.Date),
previous_orders := frank(Order.Date, ties.method = "min") - 1L,
by = Phone.Number][
order(Phone.Number, Order.Date)]
Order.Date Phone.Number Count previous_orders 1: 2017-07-09 7353478602 0 0 2: 2017-07-10 7353478602 1 1 3: 2017-07-11 7353478602 2 2 4: 2017-07-09 8123246689 0 0 5: 2017-07-10 8123246689 1 1 6: 2017-07-10 8123246689 1 1 7: 2017-07-11 8123246689 3 3 8: 2017-07-11 8123246689 3 3 9: 2017-07-12 8123246689 5 5 10: 2017-07-08 8867413567 0 0 11: 2017-07-09 9036580445 0 0 12: 2017-07-11 9164539082 0 0 13: 2017-07-09 9538991240 0 0 14: 2017-07-08 9675623760 0 0 15: 2017-07-12 9845798557 0 0 16: 2017-07-12 9886668467 0 0 17: 2017-07-10 9886728132 0 0 18: 2017-07-12 9902789900 0 0
如果日期列已经具有正确的类Date
,则对lubridate::dmy()
的调用将创建NA并且必须省略,例如,
setDT(rawdata)[order(orderdate), previous_orders := (1:.N) - 1L, by = phone][order(phone, orderdate)]
问题中提供的小数据集:
DT1 <- structure(list(OrderDate = c("01-Jul-17", "02-Jul-17", "02-Jul-17",
"03-Jul-17", "01-Jul-17", "03-Jul-17", "05-Jul-17", "07-Jul-17",
"09-Jul-17", "11-Jul-17", "13-Jul-17", "15-Jul-17"), id = c(1L,
1L, 2L, 3L, 4L, 4L, 5L, 6L, 2L, 1L, 4L, 3L), no_of_orders_before_row_date = c(0L,
1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 2L, 2L, 1L)), .Names = c("OrderDate",
"id", "no_of_orders_before_row_date"), row.names = c(NA, -12L
), class = "data.frame")
从给定链接下载的较大数据集为csv文件:
library(data.table)
DT2 <- fread("R doubts - Sheet1.csv", drop = 4L, skip = 1L, check.names = TRUE,
colClasses = c("Phone Number" = "character"))
或
DT2 <- structure(list(Order.Date = c("8/7/2017", "9/7/2017", "10/7/2017",
"11/7/2017", "12/7/2017", "9/7/2017", "10/7/2017", "11/7/2017",
"12/7/2017", "9/7/2017", "10/7/2017", "11/7/2017", "12/7/2017",
"9/7/2017", "10/7/2017", "11/7/2017", "12/7/2017", "8/7/2017"
), Phone.Number = c("9675623760", "9036580445", "7353478602",
"7353478602", "9845798557", "7353478602", "8123246689", "9164539082",
"9902789900", "9538991240", "9886728132", "8123246689", "8123246689",
"8123246689", "8123246689", "8123246689", "9886668467", "8867413567"
), Count = c(0L, 0L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 3L,
5L, 0L, 1L, 3L, 0L, 0L)), .Names = c("Order.Date", "Phone.Number",
"Count"), row.names = c(NA, -18L), class = "data.frame")
答案 1 :(得分:0)
你可以用......
来做到这一点df$OrderDate <- as.Date(df$OrderDate, format="%d-%b-%y") #your dates are type character
df$prevOrders <- sapply(1:nrow(df),function(i)
sum(df$OrderDate<df$OrderDate[i] & df$id==df$id[i]))
df
OrderDate id prevOrders
1 2017-07-01 1 0
2 2017-07-02 1 1
3 2017-07-02 2 0
4 2017-07-03 3 0
5 2017-07-01 4 0
6 2017-07-03 4 1
7 2017-07-05 5 0
8 2017-07-07 6 0
9 2017-07-09 2 1
10 2017-07-11 1 2
11 2017-07-13 4 2
12 2017-07-15 3 1