如何基于R中的一些常用单词来计数

时间:2019-03-28 06:26:00

标签: r

我有一个数据框,其中一列的行几乎相同,除了几个单词。 因此,我想在此栏中输入一些常用的单词或样式。由于数据量巨大,我提供了示例输入。

u=data.frame(text=c("you can find details on sunday",
                    "you may find details on sunday",
                    "you will find details on saturday",
                    "where can I get my personal details on portal",
                    "where to see personal details"),stringsAsFactors = FALSE)

对于所有这些,我的计数都为1。但是,如果它们具有常用词,我想合并计数以得到count的总和。

具有2列-textcount的数据框中的预期结果: "you can find details"-计数应为3 "my personal details"-计数应为2

2 个答案:

答案 0 :(得分:1)

一种base R解决方案是使用gregexpr/regmatches根据单词的向量('str1')提取单词,然后将向量的list分配给列

u[c("find", "personal")] <- lapply(str1, function(x) 
             lengths(regmatches(u$text, gregexpr(x, u$text))))
u
#                                           text find personal
#1                you can find details on sunday    1        0
#2                you may find details on sunday    1        0
#3             you will find details on saturday    1        0
#4 where can I get my personal details on portal    0        1
#5                 where to see personal details    0        1

数据

str1 <- c("find details","personal details")

答案 1 :(得分:0)

使用stringr package中的tidyverse解决问题的str_count

library(tidyverse)

str <- c("find details","personal details")

u %>% 
  mutate( find =  stringr::str_count(text, str[1]),
          personal =  stringr::str_count(text, str[2]),
          )

输出:

                                              text find personal
1                you can find details on sunday    1        0
2                you may find details on sunday    1        0
3             you will find details on saturday    1        0
4 where can I get my personal details on portal    0        1
5                 where to see personal details    0        1