Question

我的文字如下：

".OESOPHAGUS: inflammation. STOMACH: Lots of information here.DUODENUM: Some more information. ENDOSCOPIC DIAGNOSIS blabla"

我想更换任何句号后跟一个字母（大写或小写）替换为句号，换行符，然后是字母。所以输出应该是：

".\nOESOPHAGUS: inflammation. .\nSTOMACH: Lots of information here. .\nDUODENUM: Some more information. .\nENDOSCOPIC DIAGNOSIS blabla"

我试过了：

gsub("\\..*?([A-Za-z])","\\.\n\\1",MyData$Algo)

但是这给了我：

.\nESOPHAGUS: inflammation.\nTOMACH: Lots of information here.DUODENUM: Some more information.\nNDOSCOPIC DIAGNOSIS blabla"

问题似乎在于指定范围的匹配。有没有办法做这个查找 - 替换。我不依赖于gsub。

Answer 1

Perl兼容正则表达式（PCRE）在此示例中应该可以正常工作。

a =  ".OESOPHAGUS: inflammation. STOMACH: Lots of information here.DUODENUM: Some more information. ENDOSCOPIC DIAGNOSIS blabla"

gsub("\\..*?([A-Za-z])","\\.\n\\1", a , perl = T)
#output:
".\nOESOPHAGUS: inflammation.\nSTOMACH: Lots of information here.\nDUODENUM: Some more information.\nENDOSCOPIC DIAGNOSIS blabla"

我不确定为什么延迟匹配的行为与perl = F时的行为相同。

Answer 2

我不确定你为什么要. .而不仅仅是.\n，这适用于后者：

gsub('[.]\\s*([a-zA-Z])', '.\n\\1', str)
# [1] ".\nOESOPHAGUS: inflammation.\nSTOMACH: Lots of information here.\nDUODENUM: Some more information.\nENDOSCOPIC DIAGNOSIS blabla"

使用cat打印到控制台时，如下所示：

cat(gsub('[.]\\s*([a-zA-Z])', '.\n\\1', str))
# .
# OESOPHAGUS: inflammation.
# STOMACH: Lots of information here.
# DUODENUM: Some more information.
# ENDOSCOPIC DIAGNOSIS blabla

我无法解释为什么.*?没有做你想做的事。但是在这种情况下没有理由使用.，因为你做对你想要在句号和字母之间匹配的字符类型有限制（我假设空格\s就够了。）

如何使用gsub查找和替换范围

2 个答案: