Question

我希望你能够协助进行文本挖掘练习。我对AAPL＆＃39;感兴趣推文，并能够从API中提取500条推文。我能够自己清除几个障碍，但最后一部分需要帮助。出于某种原因，tm包不会删除停用词。你能看一下，看看问题可能是什么？表情符号会导致问题吗？

在绘制Term_Frequency之后，最常用的术语是＆＃34; AAPL＆＃34;，＆＃34; Apple＆＃34;，＆＃34; iPhone＆＃34;，＆＃34; Price＆＃34;，＆＃ 34;库存＆＃34;

提前致谢！

Munckinn

transform into dataframe
tweets.df <- twListToDF(tweets)

#Isolate text from tweets
aapl_tweets <- tweets.df$text

#Deal with emoticons
tweets2 <- data.frame(text = iconv(aapl_tweets, "latin1", "ASCII", "bye"), stringsAsFactors = FALSE)

#Make a vector source:
aapl_source <- VectorSource(tweets2)

#make a volatile corpus
aapl_corpus <- VCorpus(aapl_source)
aapl_cleaned <- clean_corpus(aapl_source)

#create my list to remove words
myList <- c("aapl", "apple", "stock", "stocks", stopwords("en"))

#clean corpus function 

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace, mc.cores = 1)
  corpus <- tm_map(corpus, removePunctuation, mc.cores = 1)
  corpus <- tm_map(corpus, removeWords, myList, mc.cores = 1)
  return(corpus)
}

#clean aapl corpus
aapl_cleaned <- clean_corpus(aapl_corpus)

#convert to TDM
aapl.tdm <- TermDocumentMatrix(aapl_cleaned)

aapl.tdm

#Convert as Matrix
aapl_m <- as.matrix(aapl.tdm)

#Create Frequency tables
term_frequency <- rowSums(aapl_m)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:10]

barplot(term_frequency[1:10])

Answer 1

我认为您的问题出在iconv 改变＆＃34; bye＆＃34;到＆＃34;字节＆＃34;

   tweets2 <- data.frame(
              text = iconv(aapl_tweets, "latin1", "ASCII", "byte"),
              stringsAsFactors = FALSE)

股票推文，文本挖掘，表情符号Erros

1 个答案: