Question

尝试在R中编写一些regex，以便为R中的字符向量中的每个字符串提取数字之间的一些单词。不幸的是，我的regex技能还不足以应对挑战。
这是问题的示例，也是我的最初尝试：

x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323", 
      "3 Anotherword wordagain Newword. 3,234 556")

m <- regexpr("[a-zA-Z]+\\s+", x, perl = TRUE)

regmatches(x, m)

这种方法只会产生

"Singleword ", "randword ", "Anotherword "

我需要的是

"Singleword", "randword & thirdword", "Anotherword wordagain Neword."

我相信它将需要一种regex模式，该模式将从一个字符开始（就像我目前所拥有的那样），然后将所有内容拉到一个数字。

Answer 1

x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323", 
       "3 Anotherword wordagain Newword. 3,234 556")

m <- regexpr("[a-zA-Z].(\\D)+", x, perl = TRUE)

regmatches(x, m)

[1]“单字”，“大字和第三字”
[3]“再次使用“另一个词”。

我使用了https://regexr.com/，它是一个速查表，用于确定如何组成正则表达式。

Answer 2

使用sub

> sub(".\\s(\\D+).*", "\\1", x)
[1] "Singleword "   "randword & thirdword "  "Anotherword wordagain Newword. "

使用str_extract

> library(stringr)
> str_extract(x, pattern = "\\D+")
[1] " Singleword "  " randword & thirdword "  " Anotherword wordagain Newword. "

Answer 3

样本数据

x <- c("1 Singleword 1,234 342", "2 randword & thirdword 1,545 323", 
   "3 Anotherword wordagain Newword. 3,234 556")

基本R

#replace als numbers and comma's with `""` (=nothing), 
# also, trim whitespaces (thanks Markus!)
trimws( gsub( "[0-9,]", "", x ) )

[1]“单字”，“大字和第三个字”，“另一个新单词的另一个词”。

stringR

library(stringr)
str_extract(x, pattern = "(?<=\\d )[^0-9]+(?= \\d)")

[1]“单字”，“大字和第三个字”，“另一个新单词的另一个词”。

如果您想在上面的代码（以及其他答案）中了解更多关于正则表达式模式（的工作原理）的信息，请访问https://regex101.com/

最后一个正则表达式的解释：https://regex101.com/r/QgERuZ/2

提取数字之间的单词

3 个答案: