这个问题是counting specific words across multiple columns in R的修改版本,但是增加了给某些列赋予不同权重的复杂性。如何使某些列计为1,另一些列计为0.5?
可复制的示例:
df <- data.frame(id=c(1, 2, 3, 4, 5), staple_1=c("potato", "potato","rice","fruit","coffee"),
staple2_half1=c("yams","beer","potato","rice","yams"),
staple2_half2=c("potato","rice","yams","rice","yams"),
staple_3=c("rice","peanuts","fruit","fruit","rice"))
potato<-c("potato")
yams<-c("yams")
staples<-c("potato","cassava","rice","yams")
给予:
id staple_1 staple2_half1 staple2_half2 staple_3
1 potato yams potato rice
2 potato beer rice peanuts
3 rice potato yams fruit
4 fruit rice rice fruit
5 coffee yams yams rice
现在,我想创建2个额外的列来汇总“马铃薯”和“薯类”的计数,但是通过修改以下代码,以便“半”列(staple2_half1和Staple2_half2)中的任何计数仅计为0.5,而不是1.
使用原始答案的错误结果:
df$staples <- apply(df, 1, function(x) sum(staples %in% x))
df$potato<- apply(df, 1, function(x) sum(potato %in% x))
df$yams<- apply(df, 1, function(x) sum(yams %in% x))
礼物:
id staple_1 staple2_half1 staple2_half2 staple_3 staples potato yams
1 potato yams potato rice 3 1 1
2 potato beer rice peanuts 2 1 0
3 rice potato yams fruit 3 1 1
4 fruit rice rice fruit 1 0 0
5 coffee yams yams rice 2 0 1
基于加权计数的所需结果:
id staple_1 staple2_half1 staple2_half2 staple_3 staples potato yams
1 potato yams potato rice 3 1.5 0.5
2 potato beer rice peanuts 1.5 1 0
3 rice potato yams fruit 2 0.5 0.5
4 fruit rice rice fruit 1 0 0
5 coffee yams yams rice 2 0 1
答案 0 :(得分:2)
如果您在apply
的列上%in%
df[, -1]
函数,则会得到一个真值和假值矩阵。然后,要进行加权和,可以将此矩阵乘以权重向量。
words <- data.frame(staples, potato, yams)
weights <- 1 - 0.5*grepl('half', names(df[, -1]))
df[names(words)] <-
lapply(words, function(x) apply(df[, -1], 2, `%in%`, x) %*% weights)
df
# id staple_1 staple2_half1 staple2_half2 staple_3 staples potato yams
# 1 1 potato yams potato rice 3.0 1.5 0.5
# 2 2 potato beer rice peanuts 1.5 1.0 0.0
# 3 3 rice potato yams fruit 2.0 0.5 0.5
# 4 4 fruit rice rice fruit 1.0 0.0 0.0
# 5 5 coffee yams yams rice 2.0 0.0 1.0
apply(df1[, -1], 2, ...
的输出示例
apply(df[, -1], 2, `%in%`, potato)
# staple_1 staple2_half1 staple2_half2 staple_3
# [1,] TRUE FALSE TRUE FALSE
# [2,] TRUE FALSE FALSE FALSE
# [3,] FALSE TRUE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE
apply(df[, -1], 2, `%in%`, potato) %*% weights
# [,1]
# [1,] 1.5
# [2,] 1.0
# [3,] 0.5
# [4,] 0.0
# [5,] 0.0
答案 1 :(得分:1)
很多方法可以做到这一点,但这是使用tidyverse的一种方法。通过“收集”数据,使订书钉全都放在一栏中,我认为更容易应用正确的权重。
library(tidyverse)
df <- data.frame(id=c(1, 2, 3, 4, 5), staple_1=c("potato", "potato","rice","fruit","coffee"),
staple2_half1=c("yams","beer","potato","rice","yams"),
staple2_half2=c("potato","rice","yams","rice","yams"),
staple_3=c("rice","peanuts","fruit","fruit","rice"))
potato<-c("potato")
yams<-c("yams")
staples<-c("potato","cassava","rice","yams")
freqs <- df %>%
mutate_if(is.factor, as.character) %>% # avoids a warning about converting types
gather("column", "item", -id) %>%
mutate(scalar = if_else(str_detect(column, "half"), 0.5, 1)) %>%
group_by(id) %>%
summarize(
staples = sum(item %in% staples * scalar),
potato = sum(item %in% potato * scalar),
yams = sum(item %in% yams * scalar)
)
left_join(df, freqs, by = "id")
#> id staple_1 staple2_half1 staple2_half2 staple_3 staples potato yams
#> 1 1 potato yams potato rice 3.0 1.5 0.5
#> 2 2 potato beer rice peanuts 1.5 1.0 0.0
#> 3 3 rice potato yams fruit 2.0 0.5 0.5
#> 4 4 fruit rice rice fruit 1.0 0.0 0.0
#> 5 5 coffee yams yams rice 2.0 0.0 1.0
由reprex package(v0.2.1)于2018-12-11创建