以下是我的数据示例:
kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L,
28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L,
9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L,
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L,
28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L,
9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989,
0.00293255131964809, 0.00495049504950495, 0.00215982721382289,
0.0120481927710843, 0.00561797752808989, 0.00293255131964809,
0.00591715976331361, 0.00495049504950495), mash_score = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
W
我想要实现的目标如下:
对于由列ID_WORKES
,TABL_NOM
,NAME
,KOD_DOR
和KOD_DEPO
组成的每个组,我希望在{ {1}}。
例如,这里有六行,其中ID_SP_NAR
的{{1}}值不同。
在这种情况下,我想对这六行进行总结,以使ID_SP_NAR == 30
的剩余值等于这六行的平均值。
所需的输出如下所示:
prop_violations
还有一件事:如果对于prop_violations中ID_SP_NAR的某些重复值,mash_ score的值> 0,则剩下的最后一个值mash_score的值> 0
例如。
prop_violations
在这种情况下,通过prop_violation仅将值0,002932551保留为ID_SP_NAR = 30,因为mash_score> 0 如何达到此条件?
答案 0 :(得分:4)
使用data.table
的选项:
setDT(kod)
kod[, {
if(any(mash_score)>0) {
i <- which(mash_score>0)[1L]
.(prop_violations=prop_violations[i], mash_score=mash_score[i])
} else
.(prop_violations=mean(prop_violations), mash_score=mash_score[1L])
},
.(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR)]
输出:
ID_WORKES TABL_NOM NAME KOD_DOR KOD_DEPO ID_SP_NAR prop_violations mash_score
1: 28029571 9716 Dim 28 9167 20 0.004500341 0
2: 28029571 9716 Dim 28 9167 30 0.002932551 1
数据:
kod <- structure(list(ID_WORKES = c(28029571L, 28029571L, 28029571L,
28029571L, 28029571L, 28029571L, 28029571L, 28029571L, 28029571L
), TABL_NOM = c(9716L, 9716L, 9716L, 9716L, 9716L, 9716L, 9716L,
9716L, 9716L), NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "Dim", class = "factor"), ID_SP_NAR = c(20L,
20L, 20L, 30L, 30L, 30L, 30L, 30L, 30L), KOD_DOR = c(28L, 28L,
28L, 28L, 28L, 28L, 28L, 28L, 28L), KOD_DEPO = c(9167L, 9167L,
9167L, 9167L, 9167L, 9167L, 9167L, 9167L, 9167L), COLUMN_MASH = c(13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L), prop_violations = c(0.00561797752808989,
0.00293255131964809, 0.00495049504950495, 0.00215982721382289,
0.0120481927710843, 0.00561797752808989, 0.00293255131964809,
0.00591715976331361, 0.00495049504950495), mash_score = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L)), row.names = c(NA, -9L), class = "data.frame")
答案 1 :(得分:3)
这是使用target-sdk-provides-dummy.bb
软件包的解决方案:
tidyverse
如果对于特定组,所有kod %>%
group_by(ID_WORKES, TABL_NOM, NAME, KOD_DOR, KOD_DEPO, ID_SP_NAR) %>%
summarise(prop_violations = if (all(mash_score == 0)) mean(prop_violations) else last(prop_violations[mash_score > 0]))
等于零,则返回平均值(使用mash_score
)。如果至少一个mean
大于零,则返回mash_score
的最后一个值prop_violations
(使用mash_score > 0
)。