如何使用gsub提取主题标签

时间:2019-03-16 08:21:20

标签: r regex string

我只想使用gsub从推文中提取主题标签。 例如:

sentence = tweet_text$text

结果为"The #Sun #Halo is out in full force today People need to look up once in awhile to see", \n "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", \n "Multiple warnings in effect for snow and wind with the latest #storm Metro"

我想要得到的只是#Sun, #halo from the first one. \n #YouthStrikeClimate, #Friday~~ from the second one. #storm From the last one.

我试图用以下方法做到这一点:

sentence = gsub("^(?!#)","",sentence,perl = TRUE) or 
sentence1 = gsub("[^#\\w+]","",sentence,perl = TRUE)

随便什么。我已经删除了无用的单词,例如数字或http://等

如何使用gsub提取它们?

2 个答案:

答案 0 :(得分:2)

我们可以使用str_extract_all中的stringr并提取所有单词,后跟一个哈希。 (#

stringr::str_extract_all(x, '#\\w+')

#[[1]]
#[1] "#Sun"  "#Halo"

#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture"  "#ClimateChange"

#[[3]]
#[1] "#storm"

具有最少正则表达式的基本R方法。我们在空白处分割字符串,并仅选择startsWith #的单词。

sapply(strsplit(x, "\\s+"), function(p) p[startsWith(p, "#")])

数据

x <- c("The #Sun #Halo is out in full force today People need to look up once in", 
  "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", 
  "Multiple warnings in effect for snow and wind with the latest #storm  Metro")

答案 1 :(得分:1)

base R中,我们可以使用regmatches/gregexpr

regmatches(x, gregexpr("#\\S+", x))
#[[1]]
#[1] "#Sun"  "#Halo"

#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture"    "#FridaysFuture"      "#ClimateChange"     

#[[3]]
#[1] "#storm"

关于同时使用gsub

trimws(gsub("(?<!#)\\b\\S+\\s*", "", x, perl = TRUE))

trimws(gsub("(^| )[A-Za-z]+\\b", "", x))

将保留以#开头的单词,并用空格分隔每个单词

数据

x <- c("The #Sun #Halo is out in full force today People need to look up once in", 
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", 
 "Multiple warnings in effect for snow and wind with the latest #storm       Metro"
 )