我有一个数据框,其中一列的行几乎相同,除了几个单词。 因此,我想在此栏中输入一些常用的单词或样式。由于数据量巨大,我提供了示例输入。
u=data.frame(text=c("you can find details on sunday",
"you may find details on sunday",
"you will find details on saturday",
"where can I get my personal details on portal",
"where to see personal details"),stringsAsFactors = FALSE)
对于所有这些,我的计数都为1。但是,如果它们具有常用词,我想合并计数以得到count
的总和。
具有2列-text
和count
的数据框中的预期结果:
"you can find details"
-计数应为3
"my personal details"
-计数应为2
答案 0 :(得分:1)
一种base R
解决方案是使用gregexpr/regmatches
根据单词的向量('str1')提取单词,然后将向量的list
分配给列
u[c("find", "personal")] <- lapply(str1, function(x)
lengths(regmatches(u$text, gregexpr(x, u$text))))
u
# text find personal
#1 you can find details on sunday 1 0
#2 you may find details on sunday 1 0
#3 you will find details on saturday 1 0
#4 where can I get my personal details on portal 0 1
#5 where to see personal details 0 1
str1 <- c("find details","personal details")
答案 1 :(得分:0)
使用stringr package中的tidyverse
解决问题的str_count
:
library(tidyverse)
str <- c("find details","personal details")
u %>%
mutate( find = stringr::str_count(text, str[1]),
personal = stringr::str_count(text, str[2]),
)
输出:
text find personal
1 you can find details on sunday 1 0
2 you may find details on sunday 1 0
3 you will find details on saturday 1 0
4 where can I get my personal details on portal 0 1
5 where to see personal details 0 1