如何计算数组的子集组中的项数大于该数组中的特定值?

时间:2015-11-02 09:00:54

标签: r count data.table subset

我想计算一个数组的子集组中的项数大于R中该数组中的特定值。参见下面的示例,每年都有一个外部基准,它是给定的一部分数据(这不是数据集的平均值)。对于给出一年基准的每一行,我想添加一列,其中重量大于基准的男性数量;和一列女性的数量大于基准。

> MyData
   year      type weight DesiredOutput1 DesiredOutput2
1  1990    Female     78             NA             NA
2  1990      Male     74             NA             NA
3  1990    Female     80             NA             NA
4  1990      Male     90             NA             NA
5  1990      Male     94             NA             NA
6  1990      Male     70             NA             NA
7  1990    Female     65             NA             NA
8  1990    Female     61             NA             NA
9  1990 benchmark     78              4              1
10 1990    Female     71             NA             NA
11 1990      Male     91             NA             NA
12 1990    Female     70             NA             NA
13 1990      Male     81             NA             NA
14 1991      Male     71             NA             NA
15 1991 benchmark     79              1              2
16 1991    Female     80             NA             NA
17 1991    Female     81             NA             NA
18 1991      Male     70             NA             NA
19 1991      Male     80             NA             NA
20 1991    Female     65             NA             NA
21 1992    Female     79             NA             NA
22 1992 benchmark     80              3              1
23 1992      Male     81             NA             NA
24 1992      Male     82             NA             NA
25 1992      Male     86             NA             NA
26 1992      Male     80             NA             NA
27 1992    Female     81             NA             NA

我可以使用以下代码添加给定年份中男性/女性人数的计数:

setDT(MyData)[, Count:=.N, by='year,type']

但我不知道如何包含这样一个事实:我只想计算一个体重大于给定年份基准的男性/女性数量。有没有办法使用这个基准值的参考?当您想要计算大于固定数字(例如大于70)的值的数量时,我已经看到了几种解决方案,但是您如何与数组中的值进行比较?

3 个答案:

答案 0 :(得分:8)

我认为你不需要所有这些NA。如果你只需要计数,你可以简单地按条件表,这是一个例子

setDT(MyData)[, as.list(table(factor(type[weight > weight[type == 'benchmark']]))), 
                by = year]
#    year Female Male
# 1: 1990      1    4
# 2: 1991      2    1
# 3: 1992      1    3

另一个选项(可能更快一点)是按条件选择事件,然后选择dcast

dcast(setDT(MyData)[, type[weight > weight[type == 'benchmark']], by = year], 
                      year ~ V1, length)
#    year Female Male
# 1: 1990      1    4
# 2: 1991      2    1
# 3: 1992      1    3

或类似地

setDT(MyData)[, type[weight > weight[type == 'benchmark']], by = year
               ][, table(year, factor(V1))]
# year   Female Male
# 1990        1    4
# 1991        2    1
# 1992        1    3

无论哪种方式,如果你坚持将结果反馈回你的原始数据集,一个快速的方法是加入(但这不会产生NA s),类似于(使用v 1.9。 6 +)

res <- dcast(setDT(MyData)[, type[weight > weight[type == 'benchmark']], by = year],
                             year ~ V1, length)
MyData[res, c("Female", "Male") := .(i.Female, i.Male), on = "year"]

答案 1 :(得分:2)

修改

这是另一种方法。在此版本中,您按重量过滤每年的基准。然后,使用count()计算男性和女性存在多少数据点。您可以使用spread()扩展数据格式。您希望将此数据与包含基准的行一起加入,这由第一个right_join()完成。最后,再次使用right_join()将此数据与原始数据合并。至少这个版本避免了以前版本中的详细过滤和变异部分。使用right_join()生成NA。

library(dplyr)
library(tidyr)

group_by(mydf, year) %>%
filter(weight > weight[which(type == "benchmark")]) %>%
count(year, type) %>%
spread(type, n) %>%
right_join(filter(mydf, type == "benchmark")) %>%
right_join(mydf)

#   year Female Male      type weight
#1  1990     NA   NA    Female     78
#2  1990     NA   NA      Male     74
#3  1990     NA   NA    Female     80
#4  1990     NA   NA      Male     90
#5  1990     NA   NA      Male     94
#6  1990     NA   NA      Male     70
#7  1990     NA   NA    Female     65
#8  1990     NA   NA    Female     61
#9  1990      1    4 benchmark     78
#10 1990     NA   NA    Female     71
#11 1990     NA   NA      Male     91
#12 1990     NA   NA    Female     70
#13 1990     NA   NA      Male     81
#14 1991     NA   NA      Male     71
#15 1991      2    1 benchmark     79
#16 1991     NA   NA    Female     80
#17 1991     NA   NA    Female     81
#18 1991     NA   NA      Male     70
#19 1991     NA   NA      Male     80
#20 1991     NA   NA    Female     65
#21 1992     NA   NA    Female     79
#22 1992      1    3 benchmark     80
#23 1992     NA   NA      Male     81
#24 1992     NA   NA      Male     82
#25 1992     NA   NA      Male     86
#26 1992     NA   NA      Male     80
#27 1992     NA   NA    Female     81

第一次尝试

这是我尝试获得所需的输出。以下代码很详细,但它可以为您提供所需的代码。首先,按年度对数据进行分组。对于每年,您选择权重大于1的行作为基准。在第二个过滤器中,排除权重等于基准的行;当你为男性和女性移除带有重量的行时,你会保留带有基准的行。然后,添加两列,一列用于男性,另一列用于具有mutate的女性。您可以使用table()计算每年存在多少男性和女性。例如,table(type)[3]是男性的统计数字。完成工作后,您需要添加已删除的数据点。因此,您希望将数据和原始数据与right_join()一起加入。

library(dplyr)
group_by(mydf, year) %>%
filter(weight >= weight[which(type == "benchmark")]) %>%
filter(!(type %in% c("Male", "Female") & weight == weight[which(type == "benchmark")])) %>%
mutate(male = ifelse(type == "benchmark", table(type)[3], NA),
       female = ifelse(type == "benchmark", table(type)[2], NA)) %>%
right_join(mydf) %>%
ungroup


#   year      type weight male female
#1  1990    Female     78   NA     NA
#2  1990      Male     74   NA     NA
#3  1990    Female     80   NA     NA
#4  1990      Male     90   NA     NA
#5  1990      Male     94   NA     NA
#6  1990      Male     70   NA     NA
#7  1990    Female     65   NA     NA
#8  1990    Female     61   NA     NA
#9  1990 benchmark     78    4      1
#10 1990    Female     71   NA     NA
#11 1990      Male     91   NA     NA
#12 1990    Female     70   NA     NA
#13 1990      Male     81   NA     NA
#14 1991      Male     71   NA     NA
#15 1991 benchmark     79    1      2
#16 1991    Female     80   NA     NA
#17 1991    Female     81   NA     NA
#18 1991      Male     70   NA     NA
#19 1991      Male     80   NA     NA
#20 1991    Female     65   NA     NA
#21 1992    Female     79   NA     NA
#22 1992 benchmark     80    3      1
#23 1992      Male     81   NA     NA
#24 1992      Male     82   NA     NA
#25 1992      Male     86   NA     NA
#26 1992      Male     80   NA     NA
#27 1992    Female     81   NA     NA

数据

mydf <- structure(list(year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 
1991L, 1991L, 1991L, 1991L, 1991L, 1992L, 1992L, 1992L, 1992L, 
1992L, 1992L, 1992L), type = structure(c(2L, 3L, 2L, 3L, 3L, 
3L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 2L, 2L, 
1L, 3L, 3L, 3L, 3L, 2L), .Label = c("benchmark", "Female", "Male"
), class = "factor"), weight = c(78L, 74L, 80L, 90L, 94L, 70L, 
65L, 61L, 78L, 71L, 91L, 70L, 81L, 71L, 79L, 80L, 81L, 70L, 80L, 
65L, 79L, 80L, 81L, 82L, 86L, 80L, 81L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27"), .Names = c("year", "type", "weight"))

答案 2 :(得分:1)

你可以这样做:

library(data.table)

setDT(df)[ ,lapply(c('Male','Female'), function(x){
               sum(type==x & weight>weight[which(type=='benchmark')])
          }), year]

#   year V1 V2
#1: 1990  4  1
#2: 1991  1  2
#3: 1992  3  1