Question

以下是我的数据示例：

kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L, 
28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 
9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L, 
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L, 
28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L, 
9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L, 
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989, 
0.00293255131964809, 0.00495049504950495, 0.00215982721382289, 
0.0120481927710843, 0.00561797752808989, 0.00293255131964809, 
0.00591715976331361, 0.00495049504950495), mash_score = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
W

我想要实现的目标如下：

对于由列ID_WORKES，TABL_NOM，NAME，KOD_DOR和KOD_DEPO组成的每个组，我希望在{ {1}}。

例如，这里有六行，其中ID_SP_NAR的{{1}}值不同。在这种情况下，我想对这六行进行总结，以使ID_SP_NAR == 30的剩余值等于这六行的平均值。

所需的输出如下所示：

prop_violations

还有一件事：如果对于prop_violations中ID_SP_NAR的某些重复值，mash_ score的值> 0，则剩下的最后一个值mash_score的值> 0

例如。

prop_violations

在这种情况下，通过prop_violation仅将值0,002932551保留为ID_SP_NAR = 30，因为mash_score> 0 如何达到此条件？

Answer 1

使用data.table的选项：

setDT(kod)
kod[, {
        if(any(mash_score)>0) {
            i <- which(mash_score>0)[1L]
            .(prop_violations=prop_violations[i], mash_score=mash_score[i])
        } else 
            .(prop_violations=mean(prop_violations), mash_score=mash_score[1L])
    }, 
    .(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR)]

输出：

   ID_WORKES TABL_NOM NAME KOD_DOR KOD_DEPO ID_SP_NAR prop_violations mash_score
1:  28029571     9716  Dim      28     9167        20     0.004500341          0
2:  28029571     9716  Dim      28     9167        30     0.002932551          1

数据：

kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L, 
    28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 
    9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L, 
            20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L, 
                28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L, 
                    9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L, 
                        13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989, 
                            0.00293255131964809, 0.00495049504950495, 0.00215982721382289, 
                            0.0120481927710843, 0.00561797752808989, 0.00293255131964809, 
                            0.00591715976331361, 0.00495049504950495), mash_score = c(0L, 
                                0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")

Answer 2

这是使用target-sdk-provides-dummy.bb软件包的解决方案：

tidyverse

如果对于特定组，所有kod %>% group_by(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR) %>% summarise(prop_violations = if (all(mash_score == 0)) mean(prop_violations) else last(prop_violations[mash_score > 0]))等于零，则返回平均值（使用mash_score）。如果至少一个mean大于零，则返回mash_score的最后一个值prop_violations（使用mash_score > 0）。

删除R中具有聚合组的重复项

2 个答案: