我是Stackoverflow的新手并且正在尝试学习R。
我想在文本中找到一组定义的单词。使用我定义的关联主题以表格格式返回这些单词的计数。
这是我的尝试:
text <- c("Green fruits are such as apples, green mangoes and avocados are good for high blood pressure. Vegetables range from greens like lettuce, spinach, Swiss chard, and mustard greens are great for heart disease. When researchers combined findings with several other long-term studies and looked at coronary heart disease and stroke separately, they found a similar protective effect for both. Green mangoes are the best.")
library(qdap)
**#Own Defined Lists**
fruit <- c("apples", "green mangoes", "avocados")
veg <- c("lettuce", "spinach", "Swiss chard", "mustard greens")
**#Splitting in Sentences**
stext <- strsplit(text, split="\\.")[[1]]
**#Obtain and Count Occurences**
library(plyr)
fruitres <- laply(fruit, function(x) grep(x, stext))
vegres <- laply(veg, function(x) grep(x, stext))
**#Quick check, and not returning 2 results for** "green mangoes"
grep("green mangoes", stext)
**#Trying with stringr package**
tag_ex <- paste0('(', paste(fruit, collapse = '|'), ')')
tag_ex
library(dplyr)
library(stringr)
themes = sapply(str_extract_all(stext, tag_ex), function(x) paste(x, collapse=','))[[1]]
themes
#Create data table
library(data.table)
data.table(fruit,fruitres)
使用相应的qdap和stringr包我无法获得我想要的解决方案。
水果和蔬菜的理想解决方案合并在一个表格中
apples fruit 1
green mangoes fruit 2
avocados fruit 1
lettuce veg 1
spinach veg 1
Swiss chard veg 1
mustard greens veg 1
任何帮助将不胜感激。谢谢
答案 0 :(得分:1)
我试图推广N
个数量的载体
library(tidyverse)
library(stringr)
创建data.frame
个向量
data <- c("fruit","veg") # vector names
L <- map(data, ~get(.x))
names(L) <- data
long <- map_df(1:length(L), ~data.frame(category=rep(names(L)[.x]), type=L[[.x]]))
# You may receive warnings about coercing to characters
# category type
# 1 fruit apples
# 2 fruit green mangoes
# 3 fruit avocados
# etc
计算每个
的实例long %>%
mutate(count=str_count(tolower(text), tolower(type)))
category type count
1 fruit apples 1
2 fruit green mangoes 2
3 fruit avocados 1
4 veg lettuce 1
# etc
我们可以轻松添加另一个矢量
health <- c("blood", "heart")
data <- c("fruit","veg", "health")
# code as above
tail
)6 veg Swiss chard 1
7 veg mustard greens 1
8 health blood 1
9 health heart 2