我想计算一个数组的子集组中的项数大于R中该数组中的特定值。参见下面的示例,每年都有一个外部基准,它是给定的一部分数据(这不是数据集的平均值)。对于给出一年基准的每一行,我想添加一列,其中重量大于基准的男性数量;和一列女性的数量大于基准。
> MyData
year type weight DesiredOutput1 DesiredOutput2
1 1990 Female 78 NA NA
2 1990 Male 74 NA NA
3 1990 Female 80 NA NA
4 1990 Male 90 NA NA
5 1990 Male 94 NA NA
6 1990 Male 70 NA NA
7 1990 Female 65 NA NA
8 1990 Female 61 NA NA
9 1990 benchmark 78 4 1
10 1990 Female 71 NA NA
11 1990 Male 91 NA NA
12 1990 Female 70 NA NA
13 1990 Male 81 NA NA
14 1991 Male 71 NA NA
15 1991 benchmark 79 1 2
16 1991 Female 80 NA NA
17 1991 Female 81 NA NA
18 1991 Male 70 NA NA
19 1991 Male 80 NA NA
20 1991 Female 65 NA NA
21 1992 Female 79 NA NA
22 1992 benchmark 80 3 1
23 1992 Male 81 NA NA
24 1992 Male 82 NA NA
25 1992 Male 86 NA NA
26 1992 Male 80 NA NA
27 1992 Female 81 NA NA
我可以使用以下代码添加给定年份中男性/女性人数的计数:
setDT(MyData)[, Count:=.N, by='year,type']
但我不知道如何包含这样一个事实:我只想计算一个体重大于给定年份基准的男性/女性数量。有没有办法使用这个基准值的参考?当您想要计算大于固定数字(例如大于70)的值的数量时,我已经看到了几种解决方案,但是您如何与数组中的值进行比较?
答案 0 :(得分:8)
我认为你不需要所有这些NA
。如果你只需要计数,你可以简单地按条件表,这是一个例子
setDT(MyData)[, as.list(table(factor(type[weight > weight[type == 'benchmark']]))),
by = year]
# year Female Male
# 1: 1990 1 4
# 2: 1991 2 1
# 3: 1992 1 3
另一个选项(可能更快一点)是按条件选择事件,然后选择dcast
dcast(setDT(MyData)[, type[weight > weight[type == 'benchmark']], by = year],
year ~ V1, length)
# year Female Male
# 1: 1990 1 4
# 2: 1991 2 1
# 3: 1992 1 3
或类似地
setDT(MyData)[, type[weight > weight[type == 'benchmark']], by = year
][, table(year, factor(V1))]
# year Female Male
# 1990 1 4
# 1991 2 1
# 1992 1 3
无论哪种方式,如果你坚持将结果反馈回你的原始数据集,一个快速的方法是加入(但这不会产生NA
s),类似于(使用v 1.9。 6 +)
res <- dcast(setDT(MyData)[, type[weight > weight[type == 'benchmark']], by = year],
year ~ V1, length)
MyData[res, c("Female", "Male") := .(i.Female, i.Male), on = "year"]
答案 1 :(得分:2)
修改强>
这是另一种方法。在此版本中,您按重量过滤每年的基准。然后,使用count()
计算男性和女性存在多少数据点。您可以使用spread()
扩展数据格式。您希望将此数据与包含基准的行一起加入,这由第一个right_join()
完成。最后,再次使用right_join()
将此数据与原始数据合并。至少这个版本避免了以前版本中的详细过滤和变异部分。使用right_join()
生成NA。
library(dplyr)
library(tidyr)
group_by(mydf, year) %>%
filter(weight > weight[which(type == "benchmark")]) %>%
count(year, type) %>%
spread(type, n) %>%
right_join(filter(mydf, type == "benchmark")) %>%
right_join(mydf)
# year Female Male type weight
#1 1990 NA NA Female 78
#2 1990 NA NA Male 74
#3 1990 NA NA Female 80
#4 1990 NA NA Male 90
#5 1990 NA NA Male 94
#6 1990 NA NA Male 70
#7 1990 NA NA Female 65
#8 1990 NA NA Female 61
#9 1990 1 4 benchmark 78
#10 1990 NA NA Female 71
#11 1990 NA NA Male 91
#12 1990 NA NA Female 70
#13 1990 NA NA Male 81
#14 1991 NA NA Male 71
#15 1991 2 1 benchmark 79
#16 1991 NA NA Female 80
#17 1991 NA NA Female 81
#18 1991 NA NA Male 70
#19 1991 NA NA Male 80
#20 1991 NA NA Female 65
#21 1992 NA NA Female 79
#22 1992 1 3 benchmark 80
#23 1992 NA NA Male 81
#24 1992 NA NA Male 82
#25 1992 NA NA Male 86
#26 1992 NA NA Male 80
#27 1992 NA NA Female 81
第一次尝试
这是我尝试获得所需的输出。以下代码很详细,但它可以为您提供所需的代码。首先,按年度对数据进行分组。对于每年,您选择权重大于1的行作为基准。在第二个过滤器中,排除权重等于基准的行;当你为男性和女性移除带有重量的行时,你会保留带有基准的行。然后,添加两列,一列用于男性,另一列用于具有mutate的女性。您可以使用table()计算每年存在多少男性和女性。例如,table(type)[3]
是男性的统计数字。完成工作后,您需要添加已删除的数据点。因此,您希望将数据和原始数据与right_join()
一起加入。
library(dplyr)
group_by(mydf, year) %>%
filter(weight >= weight[which(type == "benchmark")]) %>%
filter(!(type %in% c("Male", "Female") & weight == weight[which(type == "benchmark")])) %>%
mutate(male = ifelse(type == "benchmark", table(type)[3], NA),
female = ifelse(type == "benchmark", table(type)[2], NA)) %>%
right_join(mydf) %>%
ungroup
# year type weight male female
#1 1990 Female 78 NA NA
#2 1990 Male 74 NA NA
#3 1990 Female 80 NA NA
#4 1990 Male 90 NA NA
#5 1990 Male 94 NA NA
#6 1990 Male 70 NA NA
#7 1990 Female 65 NA NA
#8 1990 Female 61 NA NA
#9 1990 benchmark 78 4 1
#10 1990 Female 71 NA NA
#11 1990 Male 91 NA NA
#12 1990 Female 70 NA NA
#13 1990 Male 81 NA NA
#14 1991 Male 71 NA NA
#15 1991 benchmark 79 1 2
#16 1991 Female 80 NA NA
#17 1991 Female 81 NA NA
#18 1991 Male 70 NA NA
#19 1991 Male 80 NA NA
#20 1991 Female 65 NA NA
#21 1992 Female 79 NA NA
#22 1992 benchmark 80 3 1
#23 1992 Male 81 NA NA
#24 1992 Male 82 NA NA
#25 1992 Male 86 NA NA
#26 1992 Male 80 NA NA
#27 1992 Female 81 NA NA
数据强>
mydf <- structure(list(year = c(1990L, 1990L, 1990L, 1990L, 1990L, 1990L,
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1991L, 1991L,
1991L, 1991L, 1991L, 1991L, 1991L, 1992L, 1992L, 1992L, 1992L,
1992L, 1992L, 1992L), type = structure(c(2L, 3L, 2L, 3L, 3L,
3L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 2L, 2L,
1L, 3L, 3L, 3L, 3L, 2L), .Label = c("benchmark", "Female", "Male"
), class = "factor"), weight = c(78L, 74L, 80L, 90L, 94L, 70L,
65L, 61L, 78L, 71L, 91L, 70L, 81L, 71L, 79L, 80L, 81L, 70L, 80L,
65L, 79L, 80L, 81L, 82L, 86L, 80L, 81L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27"), .Names = c("year", "type", "weight"))
答案 2 :(得分:1)
你可以这样做:
library(data.table)
setDT(df)[ ,lapply(c('Male','Female'), function(x){
sum(type==x & weight>weight[which(type=='benchmark')])
}), year]
# year V1 V2
#1: 1990 4 1
#2: 1991 1 2
#3: 1992 3 1