我试图在从Twitter提取的推文上运行n-gram。
在这种情况下,我想在使用下面的软件包时删除语料库中的非英语:
我的代码如下:
# Install and Activate Packages
install.packages("twitteR", "RCurl", "RJSONIO", "stringr")
library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)
library("RWeka")
library("tm")
# Declare Twitter API Credentials
api_key <- "xxxx" # From dev.twitter.com
api_secret <- "xxxx" # From dev.twitter.com
token <- "xxxx" # From dev.twitter.com
token_secret <- "xxxx" # From dev.twitter.com
# Create Twitter Connection
setup_twitter_oauth(api_key, api_secret, token, token_secret)
# Run Twitter Search. Format is searchTwitter("Search Terms", n=100, lang="en", geocode="lat,lng", also accepts since and until).
tweets <- searchTwitter("'Chinese' OR 'chinese goverment' OR 'china goverment' OR 'china economic' OR 'chinese people'", n=10000, lang="en")
# Transform tweets list into a data frame
tweets.df <- twListToDF(tweets)
data<-as.data.frame(tweets.df[,1])
colnames(data)
colnames(data)[1]<-"TEXT"
data<-Corpus(DataframeSource(data))
docs<-data
docs <- tm_map(docs, removePunctuation) # *Removing punctuation:*
docs <- tm_map(docs, removeNumbers) # *Removing numbers:*
docs <- tm_map(docs, tolower) # *Converting to lowercase:*
docs <- tm_map(docs, removeWords, stopwords("english")) # *Removing "stopwords"
docs <- tm_map(docs, stemDocument) # *Removing common word endings* (e.g., "ing", "es")
docs <- tm_map(docs, stripWhitespace) # *Stripping whitespace
docs <- tm_map(docs, PlainTextDocument)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
options(header=TRUE, stringAsFactors=FALSE,fileencoding="latin1")
tdm <- TermDocumentMatrix(data, control = list(tokenize = BigramTokenizer))
当我跑完最后一步时,
tdm <- TermDocumentMatrix(data, control = list(tokenize = BigramTokenizer))
R给了我一个错误:
> tdm <- TermDocumentMatrix(data, control = list(tokenize = BigramTokenizer))
Error in .tolower(txt) :
invalid input 'Chinese mentality í ½í¸‚' in 'utf8towcs
任何人都知道我该怎么做才能修改我的代码并删除非英语单词?
由于