Question

我在R工作并尝试清理文件以删除不方便位置的换行符，即标签之间的所有空格

<sometext> ... \n .. </sometext>

例如

<TEXT>Purchased this as a cert pre owned for a great price. \n
Had only 10000 miles on it and jumped on it.</TEXT>

成为

<TEXT>Purchased this as a cert pre owned for a great price.Had only 10000 miles on it and jumped on it.</TEXT>

我正在考虑使用诸如

之类的表达式

(<[A-z]+>)(.+)(\n)(.+)(<\/[A-z]+>)

然后删除第3组中的任何匹配，但它必须是更多的东西＆＃34;聪明＆＃34;要做。

Answer 1

我认为你可能会过度复杂，除非我误解了一些事情：

string <- "<TEXT>Purchased this as a cert pre owned for a great price.

Had only 10000 miles on it and jumped on it.</TEXT>"

string

[1] "<TEXT>Purchased this as a cert pre owned for a great price.\n\nHad only 10000 miles on it and jumped on it.</TEXT>"

gsub("\n"," ", string)


[1] <TEXT> Purchased this as a cert pre owned for a great price.  Had
only 10000 miles on it and jumped on it.</TEXT>

更新：根据您的评论，您只想在标记对之间进行此操作。我们可以使用gsubfn包非常轻松地完成此任务：

string <- "Don't delete this newline 


<TEXT>Purchased this as a cert pre owned for a great price.
Had only 10000 miles on it and jumped on it.</TEXT>"

string

gsub("\n"," ", string)


library(gsubfn)
gsub("\n", " ", strapplyc(string, ">(.*?)</", simplify = c))

结果如下：

[1] "Purchased this as a cert pre owned for a great price. Had only 10000 miles on it and jumped on it."

不在标签之间的文本不匹配，因此不受gsub影响。

根据您的需要，您可能还需要这样的内容：

sub("<(.*?)>(.*?)</(.*?)>",gsub("\n", " ", strapplyc(string, ">(.*?)</", simplify = c)),string)

[1] "Don't delete this newline\n Purchased this as a cert pre owned for a great price. Had only 10000 miles on it and jumped on it."

Answer 2

您似乎只想在两个标记之间删除单个或单个连续换行符，并且在换行符周围有可选的空格。

使用PCRE正则表达式替换：

(<TEXT>.*?)

请参阅regex demo和R demo。

<强>详情

<TEXT> - 第1组：.然后，除了换行符之外的任何0+字符（作为PCRE正则表达式中的\h*与换行符不匹配）尽可能少，直到第一次出现的后续子徽章
\R+ - 0+水平空格（贪婪匹配）
\h* - 任何1个或多个换行符序列（CR，LF或CRLF）
(.*?</TEXT>) - 0+水平空格（贪婪匹配）
</TEXT> - 第2组：除了换行符和\1字符串之外的任何0 +字符。

\2插回第1组中的值，gsubfn对第2组值执行相同操作。

如果要在两个字符串之间替换多次出现，请使用Hack-R的> library(gsubfn) > x2 <- "<TEXT>Purchased this as a cert pre owned for a great price. \nHad only 10000 miles on it and jumped on it. \r\nAnd another sentence.</TEXT>" > gsubfn("(<TEXT>)(.*?)(</TEXT>)", function(g1,g2,g3) paste0(g1,gsub("\\h*\\R+\\h*", "", g2, perl=TRUE),g3), x2) [1] "<TEXT>Purchased this as a cert pre owned for a great price.Had only 10000 miles on it and jumped on it.And another sentence.</TEXT>"方法：

(<TEXT>)(.*?)(</TEXT>)

<TEXT>模式匹配并捕获到组1 </TEXT>，然后尽可能少地将任何0+字符捕获到组2中（延迟匹配），然后捕获到组3 {{1} }。然后，在gsubfn内的回调中，您可以使用<spaces>*<line_break(s)><spaces>*删除所有出现的gsub("\\h*\\R+\\h*", "", g2, perl=TRUE)。

如何使用正则表达式

2 个答案: