Question

我有文字页面，我想查找文字中出现的某个单词的开头和结尾位置：

<body> I need to find the position of a **certain** word from a lot of text.</body>

例如，这里某些（没有**）从第34位开始到40结束。还要计算数字和标点符号。

我怎么能在R中这样做？该文本采用xml格式。

Answer 1

使用gregexpr：

x <- "I need to find the position of a certain word from a lot of certain text,
which needs a certain text processing function."
gregexpr("certain", x, fixed = TRUE)
#[[1]]
#[1] 34 61 89
#attr(,"match.length")
#[1] 7 7 7
#attr(,"useBytes")
#[1] TRUE

Answer 2

stringi包具有非常有用的功能：

x <- "I need to find the position of a certain word from a lot of certain text, which needs a certain text processing function."

> stringi::stri_locate_all_regex(str = x, "certain") # list of start and end locations for matches
[[1]]
     start end
[1,]    34  40
[2,]    61  67
[3,]    89  95

Answer 3

您可以使用cwhmisc包。您应该将文本带入矢量

library(cwhmisc)

A<-("I need to find the position of a certain word from a lot of text")

cpos(A, "certain")

Answer 4

你可以用这个：

> regexpr("a","sjnasd")
[1] 4
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE

然而，这适用于较大字符串中第一次出现的子字符串。

Answer 5

您还可以使用stringr的{{1}}函数 - 请注意，这只是str_locate的包装，但名称更令人难忘： - ）

base::regexpr

计算R中的字母

5 个答案: