我需要R中的正则表达式方面的帮助 我有一堆字符串,每个字符串的结构都与此类似:
mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes. Thank you very much for completing this\": ME.' 'You!' sai"
请注意,此字符串包含&#34;&#34;中的子字符串。然后是&#34;:&#34;和一些没有引号的文字 - 直到我们遇到&#34; |&#34; - 然后会出现一个新的引号等。
另请注意,最后在&#34;之后有文字:&#34; - 但最后没有&#34; |&#34;
我的目标是完全消除所有以&#34;开头的文字:&#34; (并包括&#34;:&#34;)直到下一个&#34; |&#34; (但&#34; |&#34;必须留下)。我还需要删除最后一个&#34;之后的所有文本:&#34;
最后(这更多的是奖金) - 我想摆脱所有&#34; \&#34;字符和所有引号 - 因为在最终解决方案中我需要&#34;清理文本&#34;:一串字符串仅由&#34; |&#34;字符。
有可能吗?
这是我尴尬的第一次尝试:
gsub('\\:.*?\\|', '', mytext)
答案 0 :(得分:2)
此方法使用3次g?sub
。
sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"
第一个删除文本之间&#34;:&#34;和&#34; |&#34;包容性并用&#34; |&#34;替换它。第二遍删除&#34; \&#34;和&#34;&#34;&#34;并且第三遍删除&#34; |&#34;最后。
答案 1 :(得分:1)
使用单个gsub
,您可以在:
(包括:
)之后匹配文字,只要它不包含管道::[^|]*
。这也匹配字符串末尾的大小写。您还可以通过在替换字符(|
)之后搜索其他模式来匹配双引号:[\"]
gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"