我想先删除文字<br/>

时间:2019-03-30 15:03:03

标签: r regex

我要在第一个
标签之后输入文本,然后在其余文本中删除

x=data.frame(text=c("Hi John, hope you are doing well.< br/ >Let me know, when we can meet? < br/ > I have lot to talk about")

预期输出:

"Let me know, when we can meet? I have lot to talk about"

4 个答案:

答案 0 :(得分:4)

请注意,通常不适合使用正则表达式来解析HTML内容。由于您的内容未嵌套,因此此处可能可靠,我们可以尝试通过两次调用B来做到这一点:

A

sub的内部调用首先删除文本的开头部分,直到并包括第一个text <- "Hi John, hope you are doing well.< br/ >Let me know, when we can meet? < br/ > I have lot to talk about" sub("< br/ >\\s*", "", sub(".*?< br/ >(.*)", "\\1", text)) [1] "Let me know, when we can meet? I have lot to talk about" 标签。然后,对sub的第二次调用将剥离所有剩余的< br/ >标签。

答案 1 :(得分:2)

一个非正则表达式的答案是在"< br/ >"上分割并收集除第一个术语外的所有术语并将其粘贴在一起。

sapply(strsplit(as.character(x$text), "< br/ >"),
          function(x) paste0(x[-1], collapse = ""))
#[1] "Let me know, when we can meet?  I have lot to talk about"

答案 2 :(得分:1)

使用gsub的另一种效率较低的方法:

res1<-gsub("< br/ >|\\s{1,}(?<=\\n)","",gsub(".*(?=Let)","",x$text,perl=TRUE),perl=TRUE)
gsub("  ","",res1,perl=TRUE)

这会删除我之前的空格:

[1] "Let me know,when we can meet?I have lot to talk about

答案 3 :(得分:1)

我们可以使用str_extract_all提取模式(< br / >)之后出现的,不是<的所有文本

library(stringr)
paste(str_extract_all(x$text, "(?<=< br/ >)[^<]+")[[1]], collapse="")
#[1] "Let me know, when we can meet?  I have lot to talk about"

或者另一种选择是用定界符替换< br/ >,并用read.csv/read.tablepaste读取

do.call(paste0, read.csv(text = gsub("< br/ >", ";", x$text, 
  fixed = TRUE), header = FALSE, sep=";", stringsAsFactors = FALSE)[-1])
#[1] "Let me know, when we can meet?  I have lot to talk about"