从R中推文的开头和结尾删除主题标签

时间:2018-08-21 11:03:23

标签: r regex tweets

我正在尝试从R中字符串的开头删除主题标签。 例如:

 x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

我想删除字符串末尾的#lateNightThoughts和#movie主题标签。结果:

 - "I didn't know it could be #boring. guess I need some fun"

我尝试过:

stringi::stri_replace_last_regex(x,'#\\S+',"")

但是它只删除最后一个标签。

- "I didn't know it could be #boring. guess I need some fun #movie "

您知道如何获得预期的结果吗?

编辑:

如何从文本开头删除主题标签? 例如:

x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

1 个答案:

答案 0 :(得分:2)

您可以使用

>  x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\\s*\\B#\\w+(?:\\s*#\\w+)*\\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"

或者,如果您不关心要从其开始进行匹配的第一个#的上下文,则甚至可以使用

sub("(?:\\s*#\\w+)+\\s*$", "", x)

请参见regex demo

详细信息

  • \s*-零个或多个空格
  • \B-在当前位置之前,可以有字符串的开头或非单词char(通常用于确保您在“单词”中不匹配#,因此如果不需要,可以删除此非单词边界)
  • #-一个#字符
  • \w+-1个或多个单词字符(字母,数字或_
  • (?:\s*#\w+)*-零次或多次出现:
    • \s*-零个或多个空格
    • #-一个#字符
    • \w+-1个以上的字符字符
  • \s*-零个或多个空格
  • $-字符串的结尾。