Question

假设我有一个具有以下结构的数据集：

我有N个产品
我在N个国家/地区开展业务
我有N个付款合作伙伴
5月数据集包含N天
我有N种不同的价格供客户选择

例如：

customer.id <- c(1,2,3,4,5,6,7,8)
product <- c("product1","product2","product1","product2","product1","product2","product1","product2")
country <- c("country1","country2","country1","country2","country1","country2","country1","country2")
payment.partner <- c("pp1","pp2","pp1","pp2","pp1","pp2","pp1","pp2")
day <- c("day1","day2","day1","day2","day1","day2","day1","day2")
price <- c("price1","price2","price1","price2","price1","price2","price1","price2")

customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
customer.data <- data.table(customer.data)

假设我想从中生成聚合，例如，为每个组合执行预测算法。为此，我确定每个条件的唯一项，并按如下方式迭代：

unique.products <- droplevels(unique(customer.data[,product]))
unique.countries <- droplevels(unique(customer.data[,country]))
unique.payment.partners <- droplevels(unique(customer.data[,payment.partner]))
unique.days <- droplevels(unique(customer.data[,day]))
unique.prices <- droplevels(unique(customer.data[,price]))

for(i in seq_along(unique.products)){
  temp.data1 <- customer.data[product==unique.products[[i]]]
  for(j in seq_along(unique.countries)){
    temp.data2 <- temp.data1[country==unique.countries[[j]]]
    for(k in seq_along(unique.payment.partners)){
      temp.data3 <- temp.data2[payment.partner==unique.payment.partners[[k]]]
      for(l in seq_along(unique.days)){
        temp.data4 <- temp.data3[day==unique.days[[l]]]
        for(m in seq_along(unique.prices)){
          temp.data5 <- temp.data4[price==unique.prices[[m]]]
          if(nrow(temp.data5)!=0){
            # do your calculations here
            print(temp.data5)
          }
        }
      }
    }
  }
}

通常，这种代码结构工作正常，但在应用包含500万行的实际数据时会非常烦人。我猜R在速度和性能方面不是最好的语言。当然，我过去曾使用过多核处理，或者试图从Hive或MySQL DataWarehouse中直接获得这样的聚合。使用其他语言（如C ++或Python）也始终是一种选择。

然而，有时候所有这些选项都是不可能的，这总是会让我看到完全相同的处理结构。所以我很想知道，如果从相当架构的角度来看，有一个更好的，分别更快的解决方案，因为它已知（并且在基准测试时也变得非常清楚）for循环和频繁的数据子选择非常非常慢

感谢所有评论，提示和可能的解决方案！

Answer 1

您应该阅读您正在使用的软件包的文档。包data.table提供了一些优秀的introductory tutorials。

customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
library(data.table)
setDT(customer.data)
customer.data[, 
              print(customer.data[.I]), #don't do this, just refer to the columns you want to work on
              by = .(product, country, payment.partner, day, price)]

当然，通常，您不会在此处打印data.table子集，而是直接在特定列上工作。

Answer 2

根据你的描述（但不是我发现的代码，我发现它的目的不可理解，我想你可能想要使用`交互功能：

customer.data$grp=droplevels( with( customer.data,
              interaction(product, country ,payment.partner, day, price) ) )
 table(customer.data$grp)
#-----------------------
product1.country1.pp1.day1.price1 
                                4 
product2.country2.pp2.day2.price2 
                                4

然后，您可以使用lapply( split( dat, dat$grp) , analytic_function)在子集中创建单独的分析。我没有加载data.table，因此显示了数据框架的方法，但interaction没有理由在data.table世界中取得成功：

customer.data[ , grp2 := droplevels(interaction( 
                                      product, country ,payment.partner, day, price) ) ]

在迭代R中的多个条件时，如何提高性能？

2 个答案: