将单词(来自定义的列表)分组为R

时间:2017-08-09 17:21:31

标签: r

我是Stackoverflow的新手并且正在尝试学习R。

我想在文本中找到一组定义的单词。使用我定义的关联主题以表格格式返回这些单词的计数。

这是我的尝试:

text <- c("Green fruits are such as apples, green mangoes and avocados are good for high blood pressure. Vegetables range from greens like lettuce, spinach, Swiss chard, and mustard greens are great for heart disease. When researchers combined findings with several other long-term studies and looked at coronary heart disease and stroke separately, they found a similar protective effect for both. Green mangoes are the best.")

library(qdap)

**#Own Defined Lists**

fruit <- c("apples", "green mangoes", "avocados") 
veg <- c("lettuce", "spinach", "Swiss chard", "mustard greens")

**#Splitting in Sentences**

stext <- strsplit(text, split="\\.")[[1]] 

**#Obtain and Count Occurences**
library(plyr) 
fruitres <- laply(fruit, function(x) grep(x, stext))
vegres <- laply(veg, function(x) grep(x, stext))

**#Quick check, and not returning 2 results for** "green mangoes"
grep("green mangoes", stext)

**#Trying with stringr package**
tag_ex <- paste0('(', paste(fruit, collapse = '|'), ')')
tag_ex

library(dplyr)
library(stringr)


themes = sapply(str_extract_all(stext, tag_ex), function(x) paste(x, collapse=','))[[1]]
themes     


#Create data table
library(data.table)
data.table(fruit,fruitres)

使用相应的qdap和stringr包我无法获得我想要的解决方案。

水果和蔬菜的理想解决方案合并在一个表格中

apples               fruit     1
green mangoes        fruit     2
avocados             fruit     1
lettuce              veg       1
spinach              veg       1
Swiss chard          veg       1
mustard greens       veg       1

任何帮助将不胜感激。谢谢

1 个答案:

答案 0 :(得分:1)

我试图推广N个数量的载体

tidyverse和stringr解决方案

library(tidyverse)
library(stringr)

创建data.frame个向量

data <- c("fruit","veg")   # vector names
L <- map(data, ~get(.x))
names(L) <- data
long <- map_df(1:length(L), ~data.frame(category=rep(names(L)[.x]), type=L[[.x]]))

# You may receive warnings about coercing to characters

#   category           type
# 1    fruit         apples
# 2    fruit  green mangoes
# 3    fruit       avocados
# etc

计算每个

的实例
long %>%
  mutate(count=str_count(tolower(text), tolower(type)))

输出

  category           type count
1    fruit         apples     1
2    fruit  green mangoes     2
3    fruit       avocados     1
4      veg        lettuce     1
# etc

额外的东西

我们可以轻松添加另一个矢量

health <- c("blood", "heart")
data <- c("fruit","veg", "health")

# code as above

额外输出(tail

6      veg    Swiss chard     1
7      veg mustard greens     1
8   health          blood     1
9   health          heart     2