删除两个特定字符之间的几个字符串

时间:2017-02-08 16:05:33

标签: r regex

我需要R中的正则表达式方面的帮助 我有一堆字符串,每个字符串的结构都与此类似:

mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes.  Thank you very much for completing this\": ME.' 'You!' sai"

请注意,此字符串包含&#34;&#34;中的子字符串。然后是&#34;:&#34;和一些没有引号的文字 - 直到我们遇到&#34; |&#34; - 然后会出现一个新的引号等。

另请注意,最后在&#34;之后有文字:&#34; - 但最后没有&#34; |&#34;

我的目标是完全消除所有以&#34;开头的文字:&#34; (并包括&#34;:&#34;)直到下一个&#34; |&#34; (但&#​​34; |&#34;必须留下)。我还需要删除最后一个&#34;之后的所有文本:&#34;

最后(这更多的是奖金) - 我想摆脱所有&#34; \&#34;字符和所有引号 - 因为在最终解决方案中我需要&#34;清理文本&#34;:一串字符串仅由&#34; |&#34;字符。

有可能吗?

这是我尴尬的第一次尝试:

gsub('\\:.*?\\|', '', mytext)

2 个答案:

答案 0 :(得分:2)

此方法使用3次g?sub

sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes.  Thank you very much for completing this"

第一个删除文本之间&#34;:&#34;和&#34; |&#34;包容性并用&#34; |&#34;替换它。第二遍删除&#34; \&#34;和&#34;&#34;&#34;并且第三遍删除&#34; |&#34;最后。

答案 1 :(得分:1)

使用单个gsub,您可以在:(包括:)之后匹配文字,只要它不包含管道::[^|]* 。这也匹配字符串末尾的大小写。您还可以通过在替换字符(|)之后搜索其他模式来匹配双引号:[\"]

gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes.  Thank you very much for completing this"