使用R查找单词组合

时间:2013-04-29 11:35:49

标签: r text

我正在编辑一些文字并想知道我是否可以以编程方式搜索某些单词。

这些话:几乎,几乎,接近,非常接近这些词语:确定,完整,死亡,完整,必要和灭绝。

让我说我有这个角色矢量:

text <- c("R is a very essential tool for data analysis. While it is regarded as domain specific, it is a very complete programming language. Almost certainly, many people who would benefit from using R, do not use it")

我可以让R返回一个数字向量,给出这些单词彼此相邻的行号(或句号)吗?

请注意,我使用了“肯定”,所以理想情况下我需要R来搜索包含“某些”或其他单词的单词,而不是整个单词“确定”或其他单词。

2 个答案:

答案 0 :(得分:2)

在使用grep在句子边界拆分文字后,使用strsplit

stext <- strsplit(text, split="\\.")[[1]]
grep("certain", stext)
[1] 3

答案 1 :(得分:2)

Andrie的解决方案可以更好地满足您的需求,但我正在为那些希望解析成绩单的未来搜索者提供第二种解决方案。

library(qdap)
stext <- c("R is a very essential tool for data analysis. While it is regarded 
    as domain specific, it is a very complete programming language. Almost 
    certainly, many people who would benefit from using R, do not use it.")

dat <- sentSplit(data.frame(dialogue=stext), "dialogue")
with(dat, termco(dialogue, tot, "certain"))

##   tot word.count  certain
## 1 1.1          9        0
## 2 2.2         14        0
## 3 3.3         14 1(7.14%)

请注意,标点符号很重要,我需要在最后一句中添加丢失的句号。

获取哪个句子包含“确定”的向量:

which(with(dat, termco(dialogue, tot, "certain"))$raw$certain > 0)
## [1] 3