Question

嘿，我需要帮助从我通过Twitter搜索获得的结果中删除单词，这是我使用的代码。

library("twitteR")
library("ROAuth")
cred$handshake()
save(cred, file="twitter.Rdata")
load("twitter.Rdata")
registerTwitterOAuth(cred)
tweets = searchTwitter('#apple', n = 100, lang = "en")
tweets.df = twListToDF(tweets)
names(tweets.df)
tweets.df$text
tweet.words = strsplit(tweets.df$text, "[^A-Za-z]+")
word.table = table(unlist(tweet.words))
library("tm")
myStopwords <- c(stopwords('english'), "#apple","http://")
tweet.corpus = Corpus(VectorSource(tweets.df$text))
tweet.corpus = tm_map(tweet.corpus,function(x) iconv(x, to='UTF8', sub='byte'))
tweet.corpus = tm_map(tweet.corpus, PlainTextDocument)
tweet.corpus = tm_map(tweet.corpus,removeWords, myStopwords)
tweet.dtm = DocumentTermMatrix(tweet.corpus) 
tweet.matrix = inspect(tweet.dtm)

但问题是，它不是从语料库中删除包含#apple的结果和包含Http：//的网站地址，我怎么能删除这些结果？谢谢你的帮助，马特。

Answer 1

问题是removeWords确实想要删除＆＃34;字＆＃34;不是符号。它实际上通过像这样的正则表达式工作

function (x, words) 
gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse = "|")), 
    "", x, perl = TRUE)

因此，单词的向量会通过正则表达式|（或）运算符折叠它们，然后删除这些术语。请注意，它将\b中的匹配表达式包装起来，以匹配＆＃34;字边界＆＃34;这是一个＆＃34;字符＆＃34;之间的零长度匹配;和＃34;非单词字符＆＃34;。您的条款存在的问题是，#和/有资格成为非单词字符，因此您不符合边界，这些条款不会被替换。

如果你必须删除疯狂的符号，你可能最好还是编写自己的内容转换器，以便更清楚地了解匹配条件。例如

myremove <- content_transformer(function(x, ...) {
    gsub("(#apple\\b|\\bhttp://)","",x, perl=TRUE)
})

然后你可以做

tweets<-c("test one two", "two apples","hashtag #apple", "#apple #tree", "http://microsoft.com")

library("tm")   
tweet.corpus = Corpus(VectorSource(tweets))
tweet.corpus = tm_map(tweet.corpus,content_transformer(function(x) iconv(x, to='UTF8', sub='byte')))
tweet.corpus = tm_map(tweet.corpus,removeWords, stopwords('english'))
tweet.corpus = tm_map(tweet.corpus,myremove)
tweet.dtm = DocumentTermMatrix(tweet.corpus) 
inspect(tweet.dtm)

# <<DocumentTermMatrix (documents: 5, terms: 7)>>
# Non-/sparse entries: 8/27
# Sparsity           : 77%
# Maximal term length: 13
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs #tree apples hashtag microsoft.com one test two
#    1     0      0       0             0   1    1   1
#    2     0      1       0             0   0    0   1
#    3     0      0       1             0   0    0   0
#    4     1      0       0             0   0    0   0
#    5     0      0       0             1   0    0   0

因此，我们只需添加其他转换步骤，我们就可以看到这些术语已从文档术语矩阵中删除。

Answer 2

使用qdap稍微不同的方法在（＃apple / url）删除发生时以及如何构建语料库时改变：

library(qdap); library(tm)

dat <- data.frame(
    person = paste0("person_", 1:5),
    tweets = c("test one two", "two apples","hashtag #apple", 
        "#apple #tree", "http://microsoft.com")
)

## remove specialty items
dat[["tweets"]] <- rm_default(dat[["tweets"]], pattern=pastex("@rm_url", "#apple\\b"))


myCorp <- tm_map(myCorp, removeWords, stopwords)
myCorp %>% as.dtm() %>% tm::inspect()

## <<DocumentTermMatrix (documents: 5, terms: 7)>>
## Non-/sparse entries: 8/27
## Sparsity           : 77%
## Maximal term length: 13
## Weighting          : term frequency (tf)
## 
##           Terms
## Docs       #tree apples hashtag microsoft.com one test two
##   person_1     0      0       0             0   1    1   1
##   person_2     0      1       0             0   0    0   1
##   person_3     0      0       1             0   0    0   0
##   person_4     1      0       0             0   0    0   0
##   person_5     0      0       0             1   0    0   0

无法从我的语料库结果R编程中删除停用词

2 个答案: