Question

我需要R中的正则表达式方面的帮助我有一堆字符串，每个字符串的结构都与此类似：

mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes.  Thank you very much for completing this\": ME.' 'You!' sai"

请注意，此字符串包含＆＃34;＆＃34;中的子字符串。然后是＆＃34;：＆＃34;和一些没有引号的文字 - 直到我们遇到＆＃34; |＆＃34; - 然后会出现一个新的引号等。

另请注意，最后在＆＃34;之后有文字：＆＃34; - 但最后没有＆＃34; |＆＃34;

我的目标是完全消除所有以＆＃34;开头的文字：＆＃34; （并包括＆＃34;：＆＃34;）直到下一个＆＃34; |＆＃34; （但＆＃34; |＆＃34;必须留下）。我还需要删除最后一个＆＃34;之后的所有文本：＆＃34;

最后（这更多的是奖金） - 我想摆脱所有＆＃34; \＆＃34;字符和所有引号 - 因为在最终解决方案中我需要＆＃34;清理文本＆＃34;：一串字符串仅由＆＃34; |＆＃34;字符。

有可能吗？

这是我尴尬的第一次尝试：

gsub('\\:.*?\\|', '', mytext)

Answer 1

此方法使用3次g?sub。

sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes.  Thank you very much for completing this"

第一个删除文本之间＆＃34;：＆＃34;和＆＃34; |＆＃34;包容性并用＆＃34; |＆＃34;替换它。第二遍删除＆＃34; \＆＃34;和＆＃34;＆＃34;＆＃34;并且第三遍删除＆＃34; |＆＃34;最后。

Answer 2

使用单个gsub，您可以在:（包括:）之后匹配文字，只要它不包含管道：:[^|]* 。这也匹配字符串末尾的大小写。您还可以通过在替换字符（|）之后搜索其他模式来匹配双引号：[\"]

gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes.  Thank you very much for completing this"

删除两个特定字符之间的几个字符串

2 个答案: