查找与值关联的子组,并计算每个ID

时间:2017-11-29 08:50:36

标签: r performance data.table

我请求你帮忙设计一个非常有效的解决方案,快速通过14kk行表。

基本上,问题在于为每个ID找到值为== 0的子组,并从他开始计算值== 0的连续子组数(在每个ID中)。

此新信息需要保存在由“ID”,“子组”和“计数”组成的外部表中。

尽量做到尽可能清楚,我会举一个例子 假设我们有以下数据库:

ID <- (1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)   
subgroup <- c("1a1p", "1a2p", "1a3p", "2a1p", "2a2p", "2a3p", "2a4p", "2a5p", "2a6p", "3a1p", "3a2p", "3a3p", "3a4p", "3a5p")
Value <- c(2000, 0, 0, 0, 0, 0, 0, 2000, 1800, 0, 0 , 0, 1750, 0)

df <- data.frame(ID, subgroup, Value)

对于每个ID,我们需要找到与subgroup对应的所有Value == 0,然后count找到值== 0的连续ID的数量。 /> 因此,解决方案必须如下:

ID <- c(1, 2, 3, 3)    
subgroup <- c("1a2p", "2a2p", "3a1p", "3a5p")
count <- c(1, 3, 2, 0)
solution_df <- data.frame(ID, subgroup, count)

请注意,subgroup == "1a2p"count == 0一起出现,意味着Value == 0具有subgroup,但与Value == 0连续system.info() subgroup }。

我真的希望我尽可能清楚。

使用Value对问题的先前版本进行衡量,仅考虑功能data.tableuser: 881.21 system: 109.73 elapsed: 993.43

dplyr方法

user: 91.66  system: 0.56  elapsed: 93.05  

base R方法

user: 1.67 system: 0.29  elapsed: 2.06

dplyr方法

user: 75.28  system: 1.00  elapsed: 77.16

更新

更新任务的表现:

base R方法

user: 104.75 system: 0.61  elapsed: 105.74

1/1) ErrorException session_start(): open(/var/lib/php/session/sess_fvc591eu71mbrc4tabgtr83pg7, O_RDWR) failed: No such file or directory (2) in Session.php (line 40) at HandleExceptions->handleError(2, 'session_start(): open(/var/lib/php/session/sess_fvc591eu71mbrc4tabgtr83pg7, O_RDWR) failed: No such file or directory (2)', '/var/www/praesidium/pch/vendor/ lusitanian/oauth/src/OAuth/Common/Storage/Session.php', 40, array('startSession' => true, 'sessionVariableName' => 'lusitanian-oauth-token', 'stateVariableName' => 'lusitanian-oauth-state')) at session_start() in Session.php (line 40) at Session->__construct() in OAuth.php (line 101) at OAuth->createStorageInstance('\\OAuth\\Common\\Storage\\Session') in OAuth.php (line 132) at OAuth->consumer('Salesforce') in Facade.php (line 221) at Facade::__callStatic('consumer', array('Salesforce')) in PCHPageController.php (line 130) $oSFOAuthService = \OAuth::consumer ( 'Salesforce' ); at OAuth::consumer('Salesforce') in PCHPageController.php (line 130) at PCHPageController->ShowPageLogin() at call_user_func_array(array( object(PCHPageController), 'ShowPageLogin'), array()) in Controller.php (line 55) 方法

$_ENV                       Array [24]
    [DB_CONNECTION]         "mysql"
    [SESSION_DRIVER]        "database"
    [DB_CONNECTION]         "mysql"

1 个答案:

答案 0 :(得分:1)

以下是使用dplyr

的选项
library(dplyr)
df %>%
    mutate(grp = c(TRUE, diff(Value==0)>0)) %>% 
    filter(Value ==0) %>%
    group_by(grp = cumsum(grp)) %>%
    summarise(ID = first(ID), count = n()-1) %>%
    ungroup() %>% 
    select(-grp) 
# A tibble: 4 x 2
#    ID count
#  <fctr> <dbl>
#1   1a2p     0
#2   2a2p     2
#3   3a1p     2
#4   3a5p     0

或使用rle

中的base R
data.frame(ID = with(df, ID[c(FALSE, diff(Value==0) > 0)]),
                 count = with(rle(df$Value==0), lengths[values]-1))
#     ID count
#1 1a2p     0
#2 2a2p     2
#3 3a1p     2
#4 3a5p     0

更新

通过更新的问题,我们可以通过

进行分组
df %>% 
    mutate(grp = c(TRUE, diff(Value==0)>0)) %>%
    filter(Value == 0) %>%
    group_by(ID, grp = cumsum(grp)) %>%
    summarise(subgroup = first(subgroup), count = n()-1) %>% 
    ungroup() %>% 
    select(-grp)
# A tibble: 4 x 3
#    ID subgroup count
#  <dbl>   <fctr> <dbl>
#1     1     1a2p     1
#2     2     2a1p     3
#3     3     3a1p     2
#4     3     3a5p     0

base R

res <- setNames(stack(with(df, tapply(Value == 0, ID, FUN = 
   function(x) with(rle(x), lengths[values]-1))))[2:1], c("ID", "count"))
i1 <- with(rle(df$Value == 0), rep(seq_along(values)*values, lengths))

res$subgroup <- df$subgroup[!duplicated(cbind(df['ID'], i1)) & i1 > 0]
res
#   ID count subgroup
#1  1     1     1a2p
#2  2     3     2a1p
#3  3     2     3a1p
#4  3     0     3a5p