我写了一个小代码来从R
中的推文中提取主题标签m<-c(paste("Hello! #London is gr8. #Wow"," ")) # My tweet
#m<- c("Hello! #London is gr8. #Wow")
x<- unlist(gregexpr("#(\\S+)",m))
#substring(m,x)[1]
subs<-function(x){
return(substring(m,x+1,(x-2+regexpr(" |\\n",substring(m,x)[1]))))
}
tag<- sapply(x, subs)
#x
tag
如果没有我在推文末尾附加空格,这段代码就行不通了。可能是什么原因?我也试过了。
答案 0 :(得分:1)
gregexpr
为您提供所需的信息:
R> m<- c("Hello! #London is gr8. #Wow")
R> (x<- gregexpr("#(\\S+)",m)[[1]])
[1] 8 24
attr(,"match.length")
[1] 7 4
attr(,"useBytes")
[1] TRUE
所以我们可以将match.length
与起始位置结合起来:
R> substring(m, x+1 , x - 1 + attr(x,"match.length"))
[1] "London" "Wow"
答案 1 :(得分:1)
试试这个:
m <- c("Hello! #London is gr8. #Wow")
x <- unlist(strsplit(m, " "))
tag <- x[grep("^#", x)]
tag
现在,让我们假设您有一个推文列表,如下所示:
m1 <- c("Hello! #London is gr8. #Wow")
m2 <- c("#Hello! #London is gr8. #Wow")
m3 <- c("#Hello! #London i#s gr8. #Wow")
m4 <- c("Hello! #London is gr8. #Wow ")
m <- list(m1, m2, m3, m4)
你可以写一个小函数:
getTags <- function(tweet) {
x <- unlist(strsplit(tweet, " "))
tag <- x[grep("^#", x)]
return(tag)
}
并应用它:
lapply(m, function(tweet) getTags(tweet))
[[1]]
[1] "#London" "#Wow"
[[2]]
[1] "#Hello!" "#London" "#Wow"
[[3]]
[1] "#Hello!" "#London" "#Wow"
[[4]]
[1] "#London" "#Wow"
事后的想法......
如果您想要哈希(或删除任何标点符号),该函数应为
getTags <- function(tweet) {
x <- unlist(strsplit(tweet, " "))
tag <- x[grep("^#", x)]
tag <- gsub("#", "", tag)
return(tag)
}
或
getTags <- function(tweet) {
x <- unlist(strsplit(tweet, " "))
tag <- x[grep("^#", x)]
tag <- gsub("[[:punct:]]", "", tag)
return(tag)
}
答案 2 :(得分:0)
$
匹配字符串的结尾。
m<- c("Hello! #London is gr8. #Wow")
subs<-function(x){
return(substring(m,x+1,(x-2+regexpr(" |$",substring(m,x)[1]))))
}
其余代码保持不变:
> tag
[1] "London" "Wow"