我尝试使用twitter数据进行一些文本挖掘。我做了以下事情:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))
这一切都很好但是当我想使用语料库包的tolower函数时:
data=as.data.frame(cbind(tweet=rogertext))
corpus=Corpus(VectorSource(data$tweet))
corpus=tm_map(corpus,tolower)
它传达了这个错误:
> corpus=tm_map(corpus,tolower)
Error in FUN(X[[i]], ...) :
invalid input 'RT @Federerism: Roger Federer reaches 5 million followers on twitter Love You Roger í ½í¸˜ í ½í¸ í ½í¸˜ í ½í¸ #Roger #Federer # Federerism #Maestro https:/…' in 'utf8towcs'
有什么想法会出错?
答案 0 :(得分:2)
base::tolower
扼杀特殊字符。在挖掘推文时,这通常是一个问题。您可以尝试捕捉错误或只使用stringi的tolower吊坠:
# tw <- searchTwitter('Roger Federer reaches 5 million followers on twitter Love You Roger', n=1)
download.file("https://www.dropbox.com/s/33ilhcu2v82nwuq/twitter_tolower.rda?dl=1", tf <- tempfile(fileext = ".rda"), mode="wb")
load(tf)
tw[[1]]$getText()
# [1] "RT @Federerism: Roger Federer reaches 5 million followers on twitter Love You Roger \xed��\xed�\u0098 \xed��\xed�\u008d \xed��\xed�\u0098 \xed��\xed�\u008d #Roger #Federer # Federerism #Maestro https:/…"
## Does not work:
tolower(tw[[1]]$getText())
# Error in tolower(tw[[1]]$getText()) :
# invalid input 'RT @Federerism: Roger Federer reaches 5 million followers on twitter Love You Roger í ½í¸˜ í ½í¸ í ½í¸˜ í ½í¸ #Roger #Federer # Federerism #Maestro https:/…' in 'utf8towcs'
## Works:
stringi::stri_trans_tolower(tw[[1]]$getText())
# [1] "rt @federerism: roger federer reaches 5 million followers on twitter love you roger \xed��\xed�\u0098 \xed��\xed�\u008d \xed��\xed�\u0098 \xed��\xed�\u008d #roger #federer # federerism #maestro https:/…"
## Works, too:
library(tm)
corp <- Corpus(VectorSource(tw[[1]]$getText()))
corp <- tm_map(corp, content_transformer(stringi::stri_trans_tolower))
content(corp[[1]])
# [1] "rt @federerism: roger federer reaches 5 million followers on twitter love you roger \xed��\xed�\u0098 \xed��\xed�\u008d \xed��\xed�\u0098 \xed��\xed�\u008d #roger #federer # federerism #maestro https:/…"
答案 1 :(得分:0)
尝试以下方法:
corpus <- tm_map(corpus, content_transformer(tolower))
几年前tm
包中的语法发生了变化。希望这能解决问题。