Question

我的文字包含引号，其中一些包含标点符号和箭头等特殊字符。例如：

 quotes <- c("He was thinking “my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”")

我想使用正则表达式提取引号。到目前为止，我一直在寻找包stringr;特别是str_subset()可能是相关的，但我对正则表达式缺乏经验。有什么帮助吗？

Answer 1

您可以使用基础包中的正则表达式功能执行此操作：

quotes <- c("He was thinking “my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”")

pattern <- "“[^”]*”"
matches <- gregexpr(pattern, quotes)
regmatches(quotes, matches)
## [[1]]
## [1] "“my go::d I can't get out here”. So he goes “↑beep beep↑” on the horn, this bloke went “HUh HUh,”"

函数gregexpr()查找quotes内所有模式的出现。然后可以使用函数regmatches()来提取已匹配的实际文本。

模式匹配起始和结束引号以及中间的任何字符，但结尾引用除外。使用[^”]排除结束引号，该”匹配除“.*”以外的任何字符。

另外两条评论：

您无法使用模式pattern <- "\u201c[^\u201d]*\u201d"，因为匹配是贪婪的。此模式将匹配从第一个开始到最后一个结束引用的所有内容。
您还可以使用unicode代码点来表达模式：Ubuntu 18

在R中匹配带有特殊字符的引号

1 个答案: