Question

我是R的新手，但我需要在推特上做一些文本挖掘。我正在尝试清理语料库，以便只有UTF8字符。我使用下面的函数来过滤掉非UTF字符。

#setup with own twitter key's and access tokens
library(twitteR)
library(tm)

setup_twitter_oauth(consumer_key,consumer_secret,access_token,access_secret)
keyword = "#circulatieplan"
sinceDate = "2017-3-1"
tweets = searchTwitter(keyword,n = 300,lang = 'nl',since = sinceDate)
tweets_df = twListToDF(tweets)
tweets_df
View(tweets_df)

text = tweets_df$text
corpus = Corpus(VectorSource(text))
corpus <- tm_map(corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))

corpus_clean <- tm_map(corpus, tolower)

之后我尝试将它全部小写，但后来我得到了一些输入错误。

Error in FUN(content(x), ...) : invalid input 'Elke Sleurs gehoord op de radio. Dan viel Siegi precies nog mee. #schizo ��' in 'utf 8towcs'

我的猜测是过滤效果并不完美，而且函数无法将' '更改为小写。

我并不完全了解utf过滤的工作原理以及它的意义。是否有更好的功能或如何解决此错误。

编辑：
在查看原始数据后，我发现有些推文包含超过2个字节的utf字符。
包含tis问题的推文的tweetid：858280532039397379
数据：

"Elke Sleurs gehoord op de radio. Dan viel Siegi precies nog mee. #schizo \xed\xa0\xbd\xed\xb8\xb3\xed\xa0\xbd\xed\xb9\x84 #gent #circulatieplan",

然后我试图用正则表达式删除它们。正则表达式是错误的还是不能在语料库对象上使用正则表达式？

corpus <- tm_map(corpus, content_transformer(function(x) gsub(x, pattern = "(\\)\\w+", replacement = "")))

Answer 1

我找到了一种过滤表情符号的方法。经过大量的搜索，我发现有一个函数可以在编码之间转换字符向量。 iconv documentation

...
text = tweets_df$text    
# remove emoticons
text <- sapply(text,function(row) iconv(row, "latin1", "ASCII", sub=""))
corpus = Corpus(VectorSource(text))
...

R，在utf8过滤后仍然是奇怪的字符

1 个答案: