Question

我只想使用gsub从推文中提取主题标签。例如：

sentence = tweet_text$text

结果为"The #Sun #Halo is out in full force today People need to look up once in awhile to see", \n "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", \n "Multiple warnings in effect for snow and wind with the latest #storm Metro"

我想要得到的只是#Sun, #halo from the first one. \n #YouthStrikeClimate, #Friday~~ from the second one. #storm From the last one.

我试图用以下方法做到这一点：

sentence = gsub("^(?!#)","",sentence,perl = TRUE) or 
sentence1 = gsub("[^#\\w+]","",sentence,perl = TRUE)

随便什么。我已经删除了无用的单词，例如数字或http：//等

如何使用gsub提取它们？

Answer 1

我们可以使用str_extract_all中的stringr并提取所有单词，后跟一个哈希。（#。

stringr::str_extract_all(x, '#\\w+')

#[[1]]
#[1] "#Sun"  "#Halo"

#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture"  "#ClimateChange"

#[[3]]
#[1] "#storm"

具有最少正则表达式的基本R方法。我们在空白处分割字符串，并仅选择startsWith #的单词。

sapply(strsplit(x, "\\s+"), function(p) p[startsWith(p, "#")])

数据

x <- c("The #Sun #Halo is out in full force today People need to look up once in", 
  "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", 
  "Multiple warnings in effect for snow and wind with the latest #storm  Metro")

Answer 2

在base R中，我们可以使用regmatches/gregexpr

regmatches(x, gregexpr("#\\S+", x))
#[[1]]
#[1] "#Sun"  "#Halo"

#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture"    "#FridaysFuture"      "#ClimateChange"     

#[[3]]
#[1] "#storm"

关于同时使用gsub

trimws(gsub("(?<!#)\\b\\S+\\s*", "", x, perl = TRUE))

或

trimws(gsub("(^| )[A-Za-z]+\\b", "", x))

将保留以#开头的单词，并用空格分隔每个单词

数据

x <- c("The #Sun #Halo is out in full force today People need to look up once in", 
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", 
 "Multiple warnings in effect for snow and wind with the latest #storm       Metro"
 )

如何使用gsub提取主题标签

2 个答案:

数据