如何通过使用data.table来提高当前使用ddply的数据清理代码的性能?

时间:2013-01-22 06:47:56

标签: performance r data.table plyr

我正在尝试使用ddply清理数据,但它在1.3M行上运行速度非常慢。

示例代码:

#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative? 
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

我编写了一个删除行的函数,其中一行中的消耗值相同但对另一行中的消耗值为负(对于相同的id)。

#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one. 
clean_negatives <- function(x3){
  copies <- abs(sum(x3$Neg))
  sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
  x3 <- x3[0:copies,]
  x3$Consumption <- sgn*x3$Consumption
  x3$Neg <- NULL
  x3}

然后我使用ddply应用该函数来删除数据中的这些错误行

ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
                    function(x){clean_negatives(x)})
proc.time() - ptm

我希望我可以使用data.table来加快速度,但我无法解决如何使用data.table来提供帮助。

有1.3M的行,到目前为止它整天都在用我的桌面进行计算,但仍然没有完成。

1 个答案:

答案 0 :(得分:6)

您的问题询问data.table实施情况。所以,我在这里展示了它。您的功能也可以大大简化。您可以先汇总sign,然后过滤表格,然后将Neg乘以Consumption(如下所示)。

sign

<强>基准

  • 包含百万行的数据:

    require(data.table)
    # get the data.table in dt
    dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
    # first obtain the sign directly
    dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
    # then filter by abs(sum(Neg))
    dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
    # modifying for final output (line commented after Statquant's comment
    # dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
    dt.fil[, Consumption := (Consumption*sign)]
    dt.fil <- subset(dt.fil, select=-c(Neg, sign))
    
  • #Create Sample Data Frame num_rows <- 1e6 df <- data.frame(id=sample(1:20, num_rows, replace=T), Consumption=sample(-20:20, num_rows, replace=T), StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01")) df$EndDate <- df$StartDate + 90 df$Neg <- ifelse(df$Consumption < 0, -1, 1) df$Consumption <- abs(df$Consumption) 功能:

    data.table
  • 您的FUN.DT <- function() { require(data.table) dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption")) dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")] dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by=c("id", "StartDate", "EndDate", "Consumption")] dt.fil[, Consumption := (Consumption*sign)] dt.fil <- subset(dt.fil, select=-c(Neg, sign)) }

    功能
    ddply
  • 使用FUN.PLYR <- function() { require(plyr) clean_negatives <- function(x3) { copies <- abs(sum(x3$Neg)) sgn <- ifelse(sum(x3$Neg) <0, -1, 1) x3 <- x3[0:copies,] x3$Consumption <- sgn*x3$Consumption x3$Neg <- NULL x3 } df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption), function(x) clean_negatives(x)) } 进行基准测试(仅限1次运行)

    rbenchmark

我的require(rbenchmark) benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed") test replications elapsed relative user.self sys.self user.child sys.child 1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0 2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0 实施速度比当前data.table实施速度快39倍(我将其与您的实现进行比较,因为功能不同)。

plyr我在函数中加载了包,以获得获得结果的完整时间。另外,出于同样的原因,我使用基准测试函数中的键将Note:转换为data.frame。因此,这是最低速度。