有条件地突变多列时如何提高效率?

时间:2019-07-17 14:34:20

标签: r performance function parallel-processing tidyverse

最近我发布了类似的问题。但是,尽管@akun友好地提供的solution成功地提供了所需的输出,但是当我将其应用于我的真实数据时,我面临着与计算时间有关的问题,这对于包含100000 * 500个数据点的数据来说是相当大的。

我想知道大数据是否有其他替代方法。下面,我介绍我尝试解决该问题的方法。它基于并行处理,但到目前为止还没有成功。我仍在尝试,但任何帮助将不胜感激。

我的数据

df<-as.data.frame(structure(list(low_account = c(1, 1, 0.5, 0.5, 0.5, 0.5), high_account = c(16, 
16, 56, 56, 56, 56), mid_account_0 = c(8.5, 8.5, 28.25, 28.25, 
28.25, 28.25), mean_account_0 = c(31.174, 30.1922101449275, 30.1922101449275, 
33.3055555555556, 31.174, 33.3055555555556), median_account_0 = c(2.1, 
3.8, 24.2, 24.2, 24.2, 24.2), low_account.1 = c(1, 1, 0.5, 0.5, 0.5, 
0.5), high_account.1 = c(16, 16, 56, 56, 56, 56), row.names = c("A001", "A002", "A003", "A004", "A005", "A006"))))

df
  low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1 high_account.1 row.names
1         1.0           16          8.50       31.17400              2.1           1.0             16      A001
2         1.0           16          8.50       30.19221              3.8           1.0             16      A002
3         0.5           56         28.25       30.19221             24.2           0.5             56      A003
4         0.5           56         28.25       33.30556             24.2           0.5             56      A004
5         0.5           56         28.25       31.17400             24.2           0.5             56      A005
6         0.5           56         28.25       33.30556             24.2           0.5             56      A006

我的尝试

library(tidyverse)
df %>% 
   parallel::mcmapply(as.matrix(mutate_at(vars(matches("(mean|median|midrange)account")), ~ replace(., .<= low_account | .>= high_account, NA))), df)

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'FUN' of mode 'function' was not found

预期产量

df
    low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1 high_account.1 row.names
    1         1.0           16          8.50       NA                    2.1           1.0             16      A001
    2         1.0           16          8.50       NA                    3.8           1.0             16      A002
    3         0.5           56         28.25       30.19221             24.2           0.5             56      A003
    4         0.5           56         28.25       33.30556             24.2           0.5             56      A004
    5         0.5           56         28.25       31.17400             24.2           0.5             56      A005
    6         0.5           56         28.25       33.30556             24.2           0.5             56      A006

2 个答案:

答案 0 :(得分:3)

您可以通过首先拉出我们要对其应用条件的列来尝试基本R解决方案:

df_matches <-stringr::str_detect(names(df),'(mid|mean|median)_account')
df_matches <- names(df)[df_matches]

然后找到满足我们条件的子集,并用NAs替换它们:

df[df_matches][df[df_matches] <= df$low_account | df[df_matches] >= df$high_account] <- NA

#   low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1
# 1         1.0           16          8.50             NA              2.1           1.0
# 2         1.0           16          8.50             NA              3.8           1.0
# 3         0.5           56         28.25       30.19221             24.2           0.5
# 4         0.5           56         28.25       33.30556             24.2           0.5
# 5         0.5           56         28.25       31.17400             24.2           0.5
# 6         0.5           56         28.25       33.30556             24.2           0.5
#   high_account.1 row.names
# 1             16      A001
# 2             16      A002
# 3             56      A003
# 4             56      A004
# 5             56      A005
# 6             56      A006

这比提供的数据要比给定解决方案快7倍:

library(microbenchmark)

microbenchmark(
  {
    df %>% 
      mutate_at(vars(matches("(mid|mean|median)_account")),
                ~ replace(., .<= low_account | .>= high_account, NA))


  },
  {

    df[df_matches][df[df_matches] <= df$low_account | df[df_matches] >= df$high_account] <- NA
  }


)


 # min       lq       mean     median    uq        max      neval
 # 2183.264 2295.653 2750.3255 2420.034 3003.7330 6188.024   100
 # 310.392  340.145  453.5984  410.258  449.3935  2005.300   100

答案 1 :(得分:1)

如果OP不介意使用<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <select name="drop1" class="form-control drop-select"> <option class="dropdown-item" name="select" value="select">Select From Below</option> <option class="dropdown-item" name='thing1' value="thing1">thing1</option> <option class="dropdown-item" name='thing2' value="thing2">thing2</option> </select> <select name="drop2" class="form-control drop-select"> <option class="dropdown-item" name="select" value="select">Select From Below</option> <option class="dropdown-item" name='thing3' value="thing3">thing3</option> <option class="dropdown-item" name='thing4' value="thing4">911</option> </select> <select name="drop3" class="form-control drop-select"> <option class="dropdown-item" name="select" value="select">Select From Below</option> <option class="dropdown-item" name='thing1' value="thing5">thing5</option> <option class="dropdown-item" name='thing2' value="thing6">thing6</option> </select> <select name="drop4" class="form-control drop-select"> <option class="dropdown-item" name="select" value="select">Select From Below</option> <option class="dropdown-item" name='thing3' value="thing7">thing7</option> <option class="dropdown-item" name='thing4' value="thing8">thing8</option> </select> <input class="btn btn-primary" type="submit" id="subsubmit" value="Generate">包,则以下一些方法在处理5000万行时应该会更快:

data.table