我在R工作并尝试清理文件以删除不方便位置的换行符,即标签之间的所有空格
<sometext> ... \n .. </sometext>
例如
<TEXT>Purchased this as a cert pre owned for a great price. \n
Had only 10000 miles on it and jumped on it.</TEXT>
成为
<TEXT>Purchased this as a cert pre owned for a great price.Had only 10000 miles on it and jumped on it.</TEXT>
我正在考虑使用诸如
之类的表达式(<[A-z]+>)(.+)(\n)(.+)(<\/[A-z]+>)
然后删除第3组中的任何匹配,但它必须是更多的东西&#34;聪明&#34;要做。
答案 0 :(得分:4)
我认为你可能会过度复杂,除非我误解了一些事情:
string <- "<TEXT>Purchased this as a cert pre owned for a great price.
Had only 10000 miles on it and jumped on it.</TEXT>"
string
[1] "<TEXT>Purchased this as a cert pre owned for a great price.\n\nHad only 10000 miles on it and jumped on it.</TEXT>"
gsub("\n"," ", string)
[1] <TEXT> Purchased this as a cert pre owned for a great price. Had
only 10000 miles on it and jumped on it.</TEXT>
更新:根据您的评论,您只想在标记对之间进行此操作。我们可以使用gsubfn
包非常轻松地完成此任务:
string <- "Don't delete this newline
<TEXT>Purchased this as a cert pre owned for a great price.
Had only 10000 miles on it and jumped on it.</TEXT>"
string
gsub("\n"," ", string)
library(gsubfn)
gsub("\n", " ", strapplyc(string, ">(.*?)</", simplify = c))
结果如下:
[1] "Purchased this as a cert pre owned for a great price. Had only 10000 miles on it and jumped on it."
不在标签之间的文本不匹配,因此不受gsub
影响。
根据您的需要,您可能还需要这样的内容:
sub("<(.*?)>(.*?)</(.*?)>",gsub("\n", " ", strapplyc(string, ">(.*?)</", simplify = c)),string)
[1] "Don't delete this newline\n Purchased this as a cert pre owned for a great price. Had only 10000 miles on it and jumped on it."
答案 1 :(得分:1)
您似乎只想在两个标记之间删除单个或单个连续换行符,并且在换行符周围有可选的空格。
使用PCRE正则表达式替换:
(<TEXT>.*?)
请参阅regex demo和R demo。
<强>详情
<TEXT>
- 第1组:.
然后,除了换行符之外的任何0+字符(作为PCRE正则表达式中的\h*
与换行符不匹配)尽可能少,直到第一次出现的后续子徽章\R+
- 0+水平空格(贪婪匹配)\h*
- 任何1个或多个换行符序列(CR,LF或CRLF)(.*?</TEXT>)
- 0+水平空格(贪婪匹配)</TEXT>
- 第2组:除了换行符和\1
字符串之外的任何0 +字符。 \2
插回第1组中的值,gsubfn
对第2组值执行相同操作。
如果要在两个字符串之间替换多次出现,请使用Hack-R的> library(gsubfn)
> x2 <- "<TEXT>Purchased this as a cert pre owned for a great price. \nHad only 10000 miles on it and jumped on it. \r\nAnd another sentence.</TEXT>"
> gsubfn("(<TEXT>)(.*?)(</TEXT>)", function(g1,g2,g3) paste0(g1,gsub("\\h*\\R+\\h*", "", g2, perl=TRUE),g3), x2)
[1] "<TEXT>Purchased this as a cert pre owned for a great price.Had only 10000 miles on it and jumped on it.And another sentence.</TEXT>"
方法:
(<TEXT>)(.*?)(</TEXT>)
<TEXT>
模式匹配并捕获到组1 </TEXT>
,然后尽可能少地将任何0+字符捕获到组2中(延迟匹配),然后捕获到组3 {{1} }。然后,在gsubfn
内的回调中,您可以使用<spaces>*<line_break(s)><spaces>*
删除所有出现的gsub("\\h*\\R+\\h*", "", g2, perl=TRUE)
。