Question

我有一个数据框，其中一列的行几乎相同，除了几个单词。因此，我想在此栏中输入一些常用的单词或样式。由于数据量巨大，我提供了示例输入。

u=data.frame(text=c("you can find details on sunday",
                    "you may find details on sunday",
                    "you will find details on saturday",
                    "where can I get my personal details on portal",
                    "where to see personal details"),stringsAsFactors = FALSE)

对于所有这些，我的计数都为1。但是，如果它们具有常用词，我想合并计数以得到count的总和。

具有2列-text和count的数据框中的预期结果： "you can find details"-计数应为3 "my personal details"-计数应为2

Answer 1

一种base R解决方案是使用gregexpr/regmatches根据单词的向量（'str1'）提取单词，然后将向量的list分配给列

u[c("find", "personal")] <- lapply(str1, function(x) 
             lengths(regmatches(u$text, gregexpr(x, u$text))))
u
#                                           text find personal
#1                you can find details on sunday    1        0
#2                you may find details on sunday    1        0
#3             you will find details on saturday    1        0
#4 where can I get my personal details on portal    0        1
#5                 where to see personal details    0        1

数据

str1 <- c("find details","personal details")

Answer 2

使用stringr package中的tidyverse解决问题的str_count：

library(tidyverse)

str <- c("find details","personal details")

u %>% 
  mutate( find =  stringr::str_count(text, str[1]),
          personal =  stringr::str_count(text, str[2]),
          )

输出：

                                              text find personal
1                you can find details on sunday    1        0
2                you may find details on sunday    1        0
3             you will find details on saturday    1        0
4 where can I get my personal details on portal    0        1
5                 where to see personal details    0        1

如何基于R中的一些常用单词来计数

2 个答案:

数据