我请求你帮忙设计一个非常有效的解决方案,快速通过14kk行表。
基本上,问题在于为每个ID找到值为== 0的子组,并从他开始计算值== 0的连续子组数(在每个ID中)。
此新信息需要保存在由“ID”,“子组”和“计数”组成的外部表中。
尽量做到尽可能清楚,我会举一个例子 假设我们有以下数据库:
ID <- (1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
subgroup <- c("1a1p", "1a2p", "1a3p", "2a1p", "2a2p", "2a3p", "2a4p", "2a5p", "2a6p", "3a1p", "3a2p", "3a3p", "3a4p", "3a5p")
Value <- c(2000, 0, 0, 0, 0, 0, 0, 2000, 1800, 0, 0 , 0, 1750, 0)
df <- data.frame(ID, subgroup, Value)
对于每个ID
,我们需要找到与subgroup
对应的所有Value == 0
,然后count
找到值== 0的连续ID的数量。 />
因此,解决方案必须如下:
ID <- c(1, 2, 3, 3)
subgroup <- c("1a2p", "2a2p", "3a1p", "3a5p")
count <- c(1, 3, 2, 0)
solution_df <- data.frame(ID, subgroup, count)
请注意,subgroup == "1a2p"
与count == 0
一起出现,意味着Value == 0
具有subgroup
,但与Value == 0
连续system.info()
subgroup
}。
我真的希望我尽可能清楚。
使用Value
对问题的先前版本进行衡量,仅考虑功能data.table
和user: 881.21 system: 109.73 elapsed: 993.43
:
dplyr
方法
user: 91.66 system: 0.56 elapsed: 93.05
base R
方法
user: 1.67 system: 0.29 elapsed: 2.06
dplyr
方法
user: 75.28 system: 1.00 elapsed: 77.16
更新任务的表现:
base R
方法
user: 104.75 system: 0.61 elapsed: 105.74
1/1) ErrorException
session_start():
open(/var/lib/php/session/sess_fvc591eu71mbrc4tabgtr83pg7,
O_RDWR) failed: No such file or directory (2)
in Session.php (line 40)
at HandleExceptions->handleError(2, 'session_start():
open(/var/lib/php/session/sess_fvc591eu71mbrc4tabgtr83pg7,
O_RDWR)
failed: No such file or directory (2)',
'/var/www/praesidium/pch/vendor/
lusitanian/oauth/src/OAuth/Common/Storage/Session.php',
40, array('startSession' => true,
'sessionVariableName' => 'lusitanian-oauth-token',
'stateVariableName' => 'lusitanian-oauth-state'))
at session_start()
in Session.php (line 40)
at Session->__construct()
in OAuth.php (line 101)
at OAuth->createStorageInstance('\\OAuth\\Common\\Storage\\Session')
in OAuth.php (line 132)
at OAuth->consumer('Salesforce')
in Facade.php (line 221)
at Facade::__callStatic('consumer', array('Salesforce'))
in PCHPageController.php (line 130)
$oSFOAuthService = \OAuth::consumer ( 'Salesforce' );
at OAuth::consumer('Salesforce')
in PCHPageController.php (line 130)
at PCHPageController->ShowPageLogin()
at call_user_func_array(array(
object(PCHPageController), 'ShowPageLogin'), array())
in Controller.php (line 55)
方法
$_ENV Array [24]
[DB_CONNECTION] "mysql"
[SESSION_DRIVER] "database"
[DB_CONNECTION] "mysql"
答案 0 :(得分:1)
以下是使用dplyr
library(dplyr)
df %>%
mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
filter(Value ==0) %>%
group_by(grp = cumsum(grp)) %>%
summarise(ID = first(ID), count = n()-1) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 2
# ID count
# <fctr> <dbl>
#1 1a2p 0
#2 2a2p 2
#3 3a1p 2
#4 3a5p 0
或使用rle
base R
data.frame(ID = with(df, ID[c(FALSE, diff(Value==0) > 0)]),
count = with(rle(df$Value==0), lengths[values]-1))
# ID count
#1 1a2p 0
#2 2a2p 2
#3 3a1p 2
#4 3a5p 0
通过更新的问题,我们可以通过
进行分组df %>%
mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
filter(Value == 0) %>%
group_by(ID, grp = cumsum(grp)) %>%
summarise(subgroup = first(subgroup), count = n()-1) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 3
# ID subgroup count
# <dbl> <fctr> <dbl>
#1 1 1a2p 1
#2 2 2a1p 3
#3 3 3a1p 2
#4 3 3a5p 0
或base R
res <- setNames(stack(with(df, tapply(Value == 0, ID, FUN =
function(x) with(rle(x), lengths[values]-1))))[2:1], c("ID", "count"))
i1 <- with(rle(df$Value == 0), rep(seq_along(values)*values, lengths))
res$subgroup <- df$subgroup[!duplicated(cbind(df['ID'], i1)) & i1 > 0]
res
# ID count subgroup
#1 1 1 1a2p
#2 2 3 2a1p
#3 3 2 3a1p
#4 3 0 3a5p