我在R中有以下数据框:
df <- data.frame(id=c('a','b','a','c','b','a'),
indicator1=c(1,0,0,0,1,1),
indicator2=c(0,0,0,1,0,1),
extra1=c(4,5,12,4,3,7),
extra2=c('z','z','x','y','x','x'))
id indicator1 indicator2 extra1 extra2
a 1 0 4 z
b 0 0 5 z
a 0 0 12 x
c 0 1 4 y
b 1 0 3 x
a 1 1 7 x
我想添加一个新列,其中包含特定ID出现的所有行的次数,各种指标等于1.例如:
id indicator1 indicator2 extra1 extra2 countInd1 countInd2 countInd1Ind2
a 1 0 4 z 2 1 1
b 0 0 5 z 1 0 0
a 0 0 12 x 2 1 1
c 0 1 4 y 0 1 0
b 1 0 3 x 1 0 0
a 1 1 7 x 2 1 1
我该怎么做?
答案 0 :(得分:4)
有几种方法。这是一个ave
和within
:
within(df, {
ind1ind2 <- ave(as.character(interaction(indicator1, indicator2, drop=TRUE)),
id, FUN = function(x) sum(x == "1.1"))
ind2 <- ave(indicator2, id, FUN = function(x) sum(x == 1))
ind1 <- ave(indicator1, id, FUN = function(x) sum(x == 1))
})
# id indicator1 indicator2 extra1 extra2 ind1 ind2 ind1ind2
# 1 a 1 0 4 z 2 1 1
# 2 b 0 0 5 z 1 0 0
# 3 a 0 0 12 x 2 1 1
# 4 c 0 1 4 y 0 1 0
# 5 b 1 0 3 x 1 0 0
# 6 a 1 1 7 x 2 1 1
这是另一种选择:
A <- setNames(aggregate(cbind(indicator1, indicator2) ~ id, df,
function(x) sum(x == 1)), c("id", "ind1", "ind2"))
B <- setNames(aggregate(interaction(indicator1, indicator2, drop = TRUE) ~ id,
df, function(x) sum(x == "1.1")), c("id", "ind1ind2"))
Reduce(function(x, y) merge(x, y), list(df, A, B))
# id indicator1 indicator2 extra1 extra2 ind1 ind2 ind1ind2
# 1 a 1 0 4 z 2 1 1
# 2 a 0 0 12 x 2 1 1
# 3 a 1 1 7 x 2 1 1
# 4 b 0 0 5 z 1 0 0
# 5 b 1 0 3 x 1 0 0
# 6 c 0 1 4 y 0 1 0
当然,如果您的数据很大,您可能需要浏览“data.table”包。它的输入次数也比within
版本少一点。
library(data.table)
DT <- data.table(df)
DT[, c("ind1", "ind2", "ind1ind2") :=
list(sum(indicator1 == 1),
sum(indicator2 == 1),
sum(interaction(indicator1, indicator2,
drop = TRUE) == "1.1")),
by = "id"]
DT
# id indicator1 indicator2 extra1 extra2 ind1 ind2 ind1ind2
# 1: a 1 0 4 z 2 1 1
# 2: b 0 0 5 z 1 0 0
# 3: a 0 0 12 x 2 1 1
# 4: c 0 1 4 y 0 1 0
# 5: b 1 0 3 x 1 0 0
# 6: a 1 1 7 x 2 1 1
而且,如果您觉得更明确,那么您也可以sum(interaction(...) == "1.1")
代替sum(indicator1 == 1 & indicator2 == 1)
。我没有基准测试哪个更有效率。 interaction
正是我首先想到的。
答案 1 :(得分:0)
或者你可以这样做:
get_freq1 = function(i) {sum(df[which(df$id == df[i,1]),]$indicator1)}
get_freq2 = function(i) {sum(df[which(df$id == df[i,1]),]$indicator2)}
df = data.frame(df, countInd1 = sapply(1:nrow(df), get_freq1), countInd2 = sapply(1:nrow(df), get_freq2))
df= data.frame(df, countInd1Ind2 = ((df$countInd1 != 0) & (df$countInd2 != 0))*1)
你得到:
# id indicator1 indicator2 extra1 extra2 countInd1 countInd2 countInd1Ind2
#1 a 1 0 4 z 2 1 1
#2 b 0 0 5 z 1 0 0
#3 a 0 0 12 x 2 1 1
#4 c 0 1 4 y 0 1 0
#5 b 1 0 3 x 1 0 0
#6 a 1 1 7 x 2 1 1