Question

我一直在努力想出一个从我正在处理的PDF文档中提取文本的解决方案。

文字如下：

"* text text text\n text text text.\n      * text text text text text text.\n"

我试图将以下内容作为向量中的单独值：

"* text text text\n text text text." 
"* text text text text text text."

我无法在\n上运行分隔符，当我尝试运行分隔符一直到最近的子弹时，它会失败。据我了解，我需要限制两个项目符号之间的范围，并且需要在最后\n创建长度，我不知道该怎么做。

这是我现在的正则表达式查询：

"\\* (.)*\n"

Answer 1

您可以使用strsplit：

string = "* text text text\n text text text.\n      * text text text text text text.\n"

unlist(strsplit(string, "\n(\\s{2,}|$)"))
# [1] "* text text text\n text text text." "* text text text text text text."

另一种选择是使用str_extract stringr将正则表达式包裹regex并使用dotall选项：

library(stringr)

unlist(str_extract_all(string, regex("\\*.+?\\.", dotall = TRUE)))
# [1] "* text text text\n text text text." "* text text text text text text."

注意：

使用dotall=TRUE，.现在也会匹配\n。

{li>
? .+?启用延迟匹配

提取特定范围之间的文本

1 个答案: