Question

我正在处理很多旧的文本材料。 OCR过程经常会出现“。”总而言之，例如“t.h.i.s i.s a test”。我想用空格“”替换这些点。但我不想摆脱指示句子结束的点。所以我正在寻找一个寻找字母/点/字母然后用零替换点的正则表达式。

    test <- "t.h.i.s i.s a test." 
    gsub(test, pattern="\\w[[:punct:]]\\w", replacement="")

但这是结果

    ".  a test."

任何建议都表示赞赏。

Answer 1

你可以做相反的事情，即提取句子中不是字符串中间点的所有内容：

.spacer{
  margin-bottom:23px;
}

如果你想包含多个句子的可能性，我们可以假设允许一个点后跟一个空格，那么你可以使用：

require(stringr)
test <- "t.h.i.s i.s a test." 
paste0(str_extract_all(test, "[^\\.]|(\\.$)")[[1]], collapse = "")

[1] "this is a test."

Answer 2

这是我最好的猜测，以及关于如何进一步增强模式的建议：

> test = "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."
> gsub("\\b((?:U[.]S|etc|M(?:r?s|r))[.]||\\d+[.]\\d+)|[.](?!$|\\s+\\p{Lu})", "\\1", test, perl=T)
[1] "T.h.i.s is a U.S. state. I drove 5.5 miles. Mr. Smith know English, French, etc. and can drive a car."

请参阅regex demo

说明：

\b((?:U[.]S|etc|M(?:r?s|r))[.]|\d+[.]\d+) - 匹配我们将在替换部分中使用\1反向引用恢复的例外情况。此部分符合U.S.，etc.，Mr.，Ms.，Mrs.，ditits+.digits和可以增强 < / LI>
| - 或
[.](?!$|\s+\p{Lu}) - 匹配一个未跟随字符串结尾的点（$）或1+个空格后跟一个大写字母（\s+\p{Lu}）

Answer 3

paste0(gsub('\\.', '', test), '.')
#[1] "this is a test."

为了让这个丑陋的句子能够使用更多句子，

paste(paste0(gsub('\\.', '', unlist(strsplit(test, '\\. '))), '.'), collapse = ' ')
#[1] "this is a test. With another sentence."

R：如何替换字符串中两个字符之间的点

3 个答案: