我正在尝试将两组单词与字符串数相匹配。这两组单词是汽车和学校,我使用stringr包将其设置为匹配来自汽车或学校的任何单词的实例。
library(stringr)
car <- c("Honda", "Chevy", "Toyota", "Ford")
school <- c("Michigan", "Ohio State", "Missouri")
car_match <- str_c(car, collapse = "|")
school_match <- str_c(school, collapse = "|")
df <- data.frame(keyword=c("He drives a Honda",
"He goes to Ohio State",
"He likes Ford and goes to Ohio State"))
df
main <- function(df) {
df$car <- as.numeric(str_detect(df$keyword, car_match))
df$school <- as.numeric(str_detect(df$keyword, school_match))
df
}
main(df)
> main(df)
keyword car school
1 He drives a Honda 1 0
2 He goes to Ohio State 0 1
3 He likes Ford and goes to Ohio State 1 1
很好,有效。
现在,我想回过头来看看我是否可以轻松计算汽车和学校水桶中每个单词的频率。&#39;
所以它应该如下所示
Car Freq
Honda 1
Chevy 0
Toyota 0
Ford 1
school Freq
Michigan 0
Ohio State 2
Missouri 0
因为汽车分类中的本田出现一次,所以它的频率为1。同样,俄亥俄州立大学在学校分类中出现两次,频率为2。
任何人都可以帮助我从分类匹配到找到分类中每个单词的频率吗?
我可能会回过头来设置车里的每个单词,因为它自己的str_c并且匹配那个方式,但是我想找到一个更简单的&#34;路由。
答案 0 :(得分:2)
也许是这样的:
sapply(car, function(x) sum(str_count(df$keyword, x)))
# Honda Chevy Toyota Ford
# 1 0 0 1
sapply(school, function(x) sum(str_count(df$keyword, x)))
# Michigan Ohio State Missouri
# 0 2 0
答案 1 :(得分:2)
您可以使用qdap包执行以下任务:
library(qdap)
key <- list(
car = c("Honda", "Chevy", "Toyota", "Ford"),
school = c("Michigan", "Ohio State", "Missouri")
)
(out <- with(df, termco(keyword, keyword, key, elim.old = FALSE)))
counts(out)
## keyword word.count Honda Chevy Toyota Ford Michigan Ohio State Missouri car school
## 1 He drives a Honda 4 1 0 0 0 0 0 0 1 0
## 2 He goes to Ohio State 5 0 0 0 0 0 1 0 0 1
## 3 He likes Ford and goes to Ohio State 8 0 0 0 1 0 1 0 1 1
colSums(counts(out)[, -1])
## word.count Honda Chevy Toyota Ford Michigan Ohio State Missouri car school
## 17 1 0 0 1 0 2 0 2 2