Question

我正在使用R中的数据表，其中包含有关在美国杂货店中出售的产品的季度信息。特别是，有一个日期栏，一个商店栏和一个产品栏。例如，这是数据的（很小）子集：

Date           StoreID       ProductID
2000-03-31     10001         20001       
2000-03-31     10001         20002
2000-03-31     10002         20001
2000-06-30     10001         20001

对于每个商店中的每种产品，我想找出在该日期之前该产品在该商店中已连续销售了多少个季度。例如，如果我们只限于查看在特定商店中出售的订书机，我们将：

Date           StoreID       ProductID
2000-03-31     10001         20001       
2000-06-30     10001         20001
2000-09-30     10001         20001
2000-12-31     10001         20001      
2001-06-30     10001         20001
2001-09-30     10001         20001
2001-12-31     10001         20001

假设这是StoreID和ProductID组合的所有数据，我想将新变量分配为：

Date           StoreID       ProductID     V
2000-03-31     10001         20001         1
2000-06-30     10001         20001         2
2000-09-30     10001         20001         3
2000-12-31     10001         20001         4
2001-06-30     10001         20001         1
2001-09-30     10001         20001         2
2001-12-31     10001         20001         3
2002-03-31     10001         20001         4
2002-06-30     10001         20001         5
2002-09-30     10001         20001         6
2002-12-31     10001         20001         7
2004-03-30     10001         20001         1
2004-06-31     10001         20001         2

请注意，由于产品在2001年第一季度未售出，因此我们在2000年第四季度之后进行了展期。此外，由于产品在2003年第一季度未售出，因此在2002年第四季度之后进行了展期。，它被分配了1。

我遇到的问题是我的实际数据集非常大（大约一千万行），因此需要高效地完成。我唯一能想到的技术效率低下。任何建议将不胜感激。

Answer 1

您可以使用自定义函数来计算季度之间的差异。

# Load data.table
library(data.table)
# Set data as a data.table object
setDT(data)
# Set key as it might be big data
setkey(data, StoreID, ProductID)

consecutiveQuarters <- function(date, timeGap = 14) {
    # Calculate difference in dates 
    # And check if this difference is less than 14 weeks
    shifts <- cumsum(c(FALSE, abs(difftime(date[-length(date)], date[-1], units = "weeks")) > timeGap))
    # Generate vector from 1 to number of consecutive quarters
    ave(shifts, shifts, FUN = seq_along)
}

# Calculate consecutive months my storeID and productID
data[, V := consecutiveQuarters(Date), .(StoreID, ProductID)]

Answer 2

创建一个变量，如果该产品在一个季度内售出，则为1，否则为0。对变量进行排序，使其从现在开始，并及时返回。

将此类变量的累积和与相同长度的序列进行比较。当销售额下降到零时，累计总和将不再等于顺序。将累计总和等于序列的次数相加，这将表明连续几个季度的销售额为正。

data <- data.frame(
  quarter = c(1, 2, 3, 4, 1, 2, 3, 4),
  store = as.factor(c(1, 1, 1, 1, 1, 1, 1, 1)),
  product = as.factor(c(1, 1, 1, 1, 2, 2, 2, 2)),
  numsold = c(5, 6, 0, 1, 7, 3, 2, 14)
)


sortedData <- data[order(-data$quarter),]

storeValues <- c("1")
productValues <- c("1","2")

dataConsec <- data.frame(store = NULL, product = NULL, ConsecutiveSales = NULL)

for (storeValue in storeValues ){
  for(productValue in productValues){

    prodSoldinQuarter <- 
      as.numeric(sortedData[sortedData$store == storeValue &
                        sortedData$product == productValue,]$numsold > 0)

    dataConsec <- rbind(dataConsec,
                        data.frame(
                          store = storeValue,
                          product = productValue,
                          ConsecutiveSales = 
                            sum(as.numeric(cumsum(prodSoldinQuarter) == 
                                     seq(1,length(prodSoldinQuarter)) 
                                    ))
                          ))

  }
}

Answer 3

从您的问题中我了解到，您真正需要V列作为季度，而不是每个季度的总和。您可以使用类似的方法。

[^\d]+

对于tidyverse和data.table，性能是相同的，在我的情况下，500万行可在12秒内工作

时间序列中的连续/不间断事件

3 个答案: