如何使用R删除字符串中其他两个重复出现的字符之间的所有字符?

时间:2018-12-14 00:45:24

标签: r regex string text gsub

以下代码在使用gsub帮助“清理”之前成功为我获取了所需的文本。

am1<-getURL("url.com")
ami1<-htmlTreeParse(am1, useInternalNodes = TRUE)
ami1.tree.parse<- unlist(xpathApply(ami1, path = '//td', fun = xmlValue))
ami1.txt<-NULL
  for (i in 2:(length(ami1.tree.parse)-1)) {
    ami1.txt<-paste(ami1.txt, as.character(ami1.tree.parse[i]), sep = ' ')
  }

问题

我无法删除采访文本中的全部问题。例如,文本如下所示:

[1] "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."

为格式化起见:

“问。你觉得婚姻状况如何?乔·史密斯:一切都很好。问:五年后你在哪里看到自己?乔·史密斯:我可能会搬到洛杉矶去问:好的。您认为您的妻子对您的想法有何看法?乔伊·史密斯:我想她会做出积极的回应。”

要完全清楚,我需要上面的文字是:

[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."

“一切都很好。我可能会搬到洛杉矶开始演戏。我想她会做出积极回应。”

我尝试过:

 ami1.txt<-gsub("Q.[^?]+H:", "",ami1.txt)
 ami1.txt<-gsub("Q.[^?]+H: ", "",ami1.txt)
 ami1.txt<-gsub("Q.*H:", "",ami1.txt)

这取决于我没有把握住regex,但是如果有人能指出我正确的方向,我将不胜感激。

可惜我撒谎了,文本显然有点复杂。我在下面的文本末尾添加了更复杂的元素。一些“问题”(问)以一个句子开头:

 str2<-"Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively.Q. That's interesting. When would you consider speaking to her?JOE SMITH: Probably, tomorrow. Q. That sounds good. How do you feel now? Better than before?JOE SMITH: Yeah I'm feeling alright."
问:问:您认为婚姻中的状况如何?乔伊·史密斯:一切都很好。问:五年后你会在哪里看到自己?乔·史密斯:我可能会搬到洛杉矶开始演艺。好的。问:您认为您的妻子对您的想法有何看法?史密斯:我认为她会积极回应。那很有意思。你什么时候考虑和她说话?乔·史密斯:大概是明天。问:听起来不错。你现在感觉怎么样?乔伊·史密斯:是的,我感觉还好。

任务保持不变,而akrun的回答使我接近:

 trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str2))
 print(str2)
 [1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively. Probably, tomorrow.  Better than before? Yeah I'm feeling alright."

[1]“一切都很好。我可能会搬到洛杉矶开始表演。我认为她会做出积极回应。也许明天。比以前更好?是的,我感觉还不错。” / p>

最终更新

Akrun的答案:

 trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str2))

我不太确定为什么上面的答案没有完全删除“ Q”和最后一个问号之间的所有内容,但是a。修改完上述问题后,我发现实际上需要查找的是从“ Q”到“:”的所有内容都将被删除。因此,我使用了tool来帮助我理解对正则表达式的理解出了什么问题。我下面将擦除“ Q”和“:”之间的所有字符。

 gsub("Q[^:]+\\?|[A-Z ]+:", "", str2)

1 个答案:

答案 0 :(得分:0)

我们可以匹配以Q开头的字符,然后是不是?[^?])的字符,再跟问号或(|)大写字母的字符,再跟一个:并将其替换为空格。如果存在前导/滞后空格,请使用trimws

trimws(gsub("Q[^?]+\\?|[A-Z ]+:", "", str1))
#[1] "It's going quite alright. I'll probably move to Los Angeles and get into acting. I think she'd respond positively."

数据

str1 <- "Q. How well do you think things are going in your marriage?JOE SMITH: It's going quite alright.Q. Where do you see yourself in five years?JOE SMITH: I'll probably move to Los Angeles and get into acting.Q. Okay. How do you think your wife feels about your thinking?JOE SMITH: I think she'd respond positively."