我只想使用gsub从推文中提取主题标签。 例如:
sentence = tweet_text$text
结果为"The #Sun #Halo is out in full force today People need to look up once in awhile to see", \n "inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange", \n "Multiple warnings in effect for snow and wind with the latest #storm Metro"
我想要得到的只是#Sun, #halo from the first one. \n #YouthStrikeClimate, #Friday~~ from the second one. #storm From the last one.
我试图用以下方法做到这一点:
sentence = gsub("^(?!#)","",sentence,perl = TRUE) or
sentence1 = gsub("[^#\\w+]","",sentence,perl = TRUE)
随便什么。我已经删除了无用的单词,例如数字或http://等
如何使用gsub
提取它们?
答案 0 :(得分:2)
我们可以使用str_extract_all
中的stringr
并提取所有单词,后跟一个哈希。 (#
。
stringr::str_extract_all(x, '#\\w+')
#[[1]]
#[1] "#Sun" "#Halo"
#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture" "#ClimateChange"
#[[3]]
#[1] "#storm"
具有最少正则表达式的基本R方法。我们在空白处分割字符串,并仅选择startsWith
#
的单词。
sapply(strsplit(x, "\\s+"), function(p) p[startsWith(p, "#")])
数据
x <- c("The #Sun #Halo is out in full force today People need to look up once in",
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange",
"Multiple warnings in effect for snow and wind with the latest #storm Metro")
答案 1 :(得分:1)
在base R
中,我们可以使用regmatches/gregexpr
regmatches(x, gregexpr("#\\S+", x))
#[[1]]
#[1] "#Sun" "#Halo"
#[[2]]
#[1] "#YouthStrikeClimate" "#FridayForFuture" "#FridaysFuture" "#ClimateChange"
#[[3]]
#[1] "#storm"
关于同时使用gsub
trimws(gsub("(?<!#)\\b\\S+\\s*", "", x, perl = TRUE))
或
trimws(gsub("(^| )[A-Za-z]+\\b", "", x))
将保留以#
开头的单词,并用空格分隔每个单词
x <- c("The #Sun #Halo is out in full force today People need to look up once in",
"inspired #YouthStrikeClimate #FridayForFuture #FridaysFuture #ClimateChange",
"Multiple warnings in effect for snow and wind with the latest #storm Metro"
)