如何使用正则表达式

时间:2017-09-04 14:55:06

标签: r regex

我在R工作并尝试清理文件以删除不方便位置的换行符,即标签之间的所有空格

<sometext> ... \n .. </sometext>

例如

<TEXT>Purchased this as a cert pre owned for a great price. \n
Had only 10000 miles on it and jumped on it.</TEXT>

成为

<TEXT>Purchased this as a cert pre owned for a great price.Had only 10000 miles on it and jumped on it.</TEXT>

我正在考虑使用诸如

之类的表达式
(<[A-z]+>)(.+)(\n)(.+)(<\/[A-z]+>)

然后删除第3组中的任何匹配,但它必须是更多的东西&#34;聪明&#34;要做。

2 个答案:

答案 0 :(得分:4)

我认为你可能会过度复杂,除非我误解了一些事情:

string <- "<TEXT>Purchased this as a cert pre owned for a great price.

Had only 10000 miles on it and jumped on it.</TEXT>"

string

[1] "<TEXT>Purchased this as a cert pre owned for a great price.\n\nHad only 10000 miles on it and jumped on it.</TEXT>"

gsub("\n"," ", string)


[1] <TEXT> Purchased this as a cert pre owned for a great price.  Had
only 10000 miles on it and jumped on it.</TEXT>

更新:根据您的评论,您只想在标记对之间进行此操作。我们可以使用gsubfn包非常轻松地完成此任务:

string <- "Don't delete this newline 


<TEXT>Purchased this as a cert pre owned for a great price.
Had only 10000 miles on it and jumped on it.</TEXT>"

string

gsub("\n"," ", string)


library(gsubfn)
gsub("\n", " ", strapplyc(string, ">(.*?)</", simplify = c))

结果如下:

[1] "Purchased this as a cert pre owned for a great price. Had only 10000 miles on it and jumped on it."

不在标签之间的文本不匹配,因此不受gsub影响。

根据您的需要,您可能还需要这样的内容:

sub("<(.*?)>(.*?)</(.*?)>",gsub("\n", " ", strapplyc(string, ">(.*?)</", simplify = c)),string)

[1] "Don't delete this newline\n Purchased this as a cert pre owned for a great price. Had only 10000 miles on it and jumped on it."

答案 1 :(得分:1)

您似乎只想在两个标记之间删除单个或单个连续换行符,并且在换行符周围有可选的空格。

使用PCRE正则表达式替换:

(<TEXT>.*?)

请参阅regex demoR demo

<强>详情

  • <TEXT> - 第1组:.然后,除了换行符之外的任何0+字符(作为PCRE正则表达式中的\h*与换行符不匹配)尽可能少,直到第一次出现的后续子徽章
  • \R+ - 0+水平空格(贪婪匹配)
  • \h* - 任何1个或多个换行符序列(CR,LF或CRLF)
  • (.*?</TEXT>) - 0+水平空格(贪婪匹配)
  • </TEXT> - 第2组:除了换行符和\1字符串之外的任何0 +字符。

\2插回第1组中的值,gsubfn对第2组值执行相同操作。

如果要在两个字符串之间替换多次出现,请使用Hack-R的> library(gsubfn) > x2 <- "<TEXT>Purchased this as a cert pre owned for a great price. \nHad only 10000 miles on it and jumped on it. \r\nAnd another sentence.</TEXT>" > gsubfn("(<TEXT>)(.*?)(</TEXT>)", function(g1,g2,g3) paste0(g1,gsub("\\h*\\R+\\h*", "", g2, perl=TRUE),g3), x2) [1] "<TEXT>Purchased this as a cert pre owned for a great price.Had only 10000 miles on it and jumped on it.And another sentence.</TEXT>" 方法:

(<TEXT>)(.*?)(</TEXT>)

<TEXT>模式匹配并捕获到组1 </TEXT>,然后尽可能少地将任何0+字符捕获到组2中(延迟匹配),然后捕获到组3 {{1} }。然后,在gsubfn内的回调中,您可以使用<spaces>*<line_break(s)><spaces>*删除所有出现的gsub("\\h*\\R+\\h*", "", g2, perl=TRUE)