使用R tm包从Corpus中删除非英语单词

时间:2016-01-16 07:22:06

标签: r twitter encoding corpus non-english

我试图在从Twitter提取的推文上运行n-gram。

在这种情况下,我想在使用下面的软件包时删除语料库中的非英语:

我的代码如下:

# Install and Activate Packages
install.packages("twitteR", "RCurl", "RJSONIO", "stringr")
library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)
library("RWeka")
library("tm")


# Declare Twitter API Credentials
api_key <- "xxxx" # From dev.twitter.com
api_secret <- "xxxx" # From dev.twitter.com
token <- "xxxx" # From dev.twitter.com
token_secret <- "xxxx" # From dev.twitter.com

# Create Twitter Connection
setup_twitter_oauth(api_key, api_secret, token, token_secret)



# Run Twitter Search. Format is searchTwitter("Search Terms", n=100, lang="en", geocode="lat,lng", also accepts since and until).

tweets <- searchTwitter("'Chinese' OR 'chinese goverment' OR 'china goverment' OR 'china economic' OR 'chinese people'", n=10000, lang="en")

# Transform tweets list into a data frame
tweets.df <- twListToDF(tweets)

data<-as.data.frame(tweets.df[,1])
colnames(data)
colnames(data)[1]<-"TEXT"




data<-Corpus(DataframeSource(data))


docs<-data




docs <- tm_map(docs, removePunctuation)   # *Removing punctuation:*    
docs <- tm_map(docs, removeNumbers)      # *Removing numbers:*    
docs <- tm_map(docs, tolower)   # *Converting to lowercase:*    
docs <- tm_map(docs, removeWords, stopwords("english"))   # *Removing "stopwords" 
docs <- tm_map(docs, stemDocument)   # *Removing common word endings* (e.g., "ing", "es")   
docs <- tm_map(docs, stripWhitespace)   # *Stripping whitespace   
docs <- tm_map(docs, PlainTextDocument)   


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))




options(header=TRUE, stringAsFactors=FALSE,fileencoding="latin1")
tdm <- TermDocumentMatrix(data, control = list(tokenize = BigramTokenizer))

当我跑完最后一步时,

tdm <- TermDocumentMatrix(data, control = list(tokenize = BigramTokenizer))

R给了我一个错误:

> tdm <- TermDocumentMatrix(data, control = list(tokenize = BigramTokenizer))
Error in .tolower(txt) : 
  invalid input 'Chinese mentality í ½í¸‚' in 'utf8towcs

任何人都知道我该怎么做才能修改我的代码并删除非英语单词?

由于

0 个答案:

没有答案