最近我发布了类似的问题。但是,尽管@akun友好地提供的solution成功地提供了所需的输出,但是当我将其应用于我的真实数据时,我面临着与计算时间有关的问题,这对于包含100000 * 500个数据点的数据来说是相当大的。
我想知道大数据是否有其他替代方法。下面,我介绍我尝试解决该问题的方法。它基于并行处理,但到目前为止还没有成功。我仍在尝试,但任何帮助将不胜感激。
我的数据
df<-as.data.frame(structure(list(low_account = c(1, 1, 0.5, 0.5, 0.5, 0.5), high_account = c(16,
16, 56, 56, 56, 56), mid_account_0 = c(8.5, 8.5, 28.25, 28.25,
28.25, 28.25), mean_account_0 = c(31.174, 30.1922101449275, 30.1922101449275,
33.3055555555556, 31.174, 33.3055555555556), median_account_0 = c(2.1,
3.8, 24.2, 24.2, 24.2, 24.2), low_account.1 = c(1, 1, 0.5, 0.5, 0.5,
0.5), high_account.1 = c(16, 16, 56, 56, 56, 56), row.names = c("A001", "A002", "A003", "A004", "A005", "A006"))))
df
low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1 high_account.1 row.names
1 1.0 16 8.50 31.17400 2.1 1.0 16 A001
2 1.0 16 8.50 30.19221 3.8 1.0 16 A002
3 0.5 56 28.25 30.19221 24.2 0.5 56 A003
4 0.5 56 28.25 33.30556 24.2 0.5 56 A004
5 0.5 56 28.25 31.17400 24.2 0.5 56 A005
6 0.5 56 28.25 33.30556 24.2 0.5 56 A006
我的尝试
library(tidyverse)
df %>%
parallel::mcmapply(as.matrix(mutate_at(vars(matches("(mean|median|midrange)account")), ~ replace(., .<= low_account | .>= high_account, NA))), df)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'FUN' of mode 'function' was not found
预期产量
df
low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1 high_account.1 row.names
1 1.0 16 8.50 NA 2.1 1.0 16 A001
2 1.0 16 8.50 NA 3.8 1.0 16 A002
3 0.5 56 28.25 30.19221 24.2 0.5 56 A003
4 0.5 56 28.25 33.30556 24.2 0.5 56 A004
5 0.5 56 28.25 31.17400 24.2 0.5 56 A005
6 0.5 56 28.25 33.30556 24.2 0.5 56 A006
答案 0 :(得分:3)
您可以通过首先拉出我们要对其应用条件的列来尝试基本R解决方案:
df_matches <-stringr::str_detect(names(df),'(mid|mean|median)_account')
df_matches <- names(df)[df_matches]
然后找到满足我们条件的子集,并用NAs
替换它们:
df[df_matches][df[df_matches] <= df$low_account | df[df_matches] >= df$high_account] <- NA
# low_account high_account mid_account_0 mean_account_0 median_account_0 low_account.1
# 1 1.0 16 8.50 NA 2.1 1.0
# 2 1.0 16 8.50 NA 3.8 1.0
# 3 0.5 56 28.25 30.19221 24.2 0.5
# 4 0.5 56 28.25 33.30556 24.2 0.5
# 5 0.5 56 28.25 31.17400 24.2 0.5
# 6 0.5 56 28.25 33.30556 24.2 0.5
# high_account.1 row.names
# 1 16 A001
# 2 16 A002
# 3 56 A003
# 4 56 A004
# 5 56 A005
# 6 56 A006
这比提供的数据要比给定解决方案快7倍:
library(microbenchmark)
microbenchmark(
{
df %>%
mutate_at(vars(matches("(mid|mean|median)_account")),
~ replace(., .<= low_account | .>= high_account, NA))
},
{
df[df_matches][df[df_matches] <= df$low_account | df[df_matches] >= df$high_account] <- NA
}
)
# min lq mean median uq max neval
# 2183.264 2295.653 2750.3255 2420.034 3003.7330 6188.024 100
# 310.392 340.145 453.5984 410.258 449.3935 2005.300 100
答案 1 :(得分:1)
如果OP不介意使用<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<select name="drop1" class="form-control drop-select">
<option class="dropdown-item" name="select" value="select">Select From Below</option>
<option class="dropdown-item" name='thing1' value="thing1">thing1</option>
<option class="dropdown-item" name='thing2' value="thing2">thing2</option>
</select>
<select name="drop2" class="form-control drop-select">
<option class="dropdown-item" name="select" value="select">Select From Below</option>
<option class="dropdown-item" name='thing3' value="thing3">thing3</option>
<option class="dropdown-item" name='thing4' value="thing4">911</option>
</select>
<select name="drop3" class="form-control drop-select">
<option class="dropdown-item" name="select" value="select">Select From Below</option>
<option class="dropdown-item" name='thing1' value="thing5">thing5</option>
<option class="dropdown-item" name='thing2' value="thing6">thing6</option>
</select>
<select name="drop4" class="form-control drop-select">
<option class="dropdown-item" name="select" value="select">Select From Below</option>
<option class="dropdown-item" name='thing3' value="thing7">thing7</option>
<option class="dropdown-item" name='thing4' value="thing8">thing8</option>
</select>
<input class="btn btn-primary" type="submit" id="subsubmit" value="Generate">
包,则以下一些方法在处理5000万行时应该会更快:
data.table