清洁语料库时,tm包函数未删除引号和连字符

时间:2015-06-23 05:04:02

标签: r text-mining tm

我正在尝试清理语料库,我使用了典型的步骤,如下面的代码:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

然而,当我检查矩阵时,几乎没有带引号的词,例如: “我们” “公司” “码 指南” -known -accelerated

似乎单词本身在引号内,但当我尝试再次运行removePunctuation代码时,它不起作用。前面还有一些带子弹的话我也无法删除。

非常感谢任何帮助。

3 个答案:

答案 0 :(得分:10)

removePunctuation使用gsub('[[:punct:]]','',x)即删除符号:!"#$%&'()*+, \-./:;<=>?@[\\\]^_ {|}〜`。要删除其他符号,例如印刷引号或子弹符号(或任何其他符号),请声明自己的转换函数:

removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)

或者你可以进一步删除所有不是字母数字符号或空格的内容:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)

答案 1 :(得分:1)

更好的构造标记器将自动处理此问题。试试这个:

> require(quanteda)
> text <- c("Enjoying \"my time\".", "Single 'air quotes'.")
> toktexts <- tokenize(toLower(text), removePunct = TRUE, removeNumbers = TRUE)
> toktexts
[[1]]
[1] "enjoying" "my"       "time"    

[[2]]
[1] "single" "air"    "quotes"

attr(,"class")
[1] "tokenizedTexts" "list"          
> dfm(toktexts, stem = TRUE, ignoredFeatures = stopwords("english"), verbose = FALSE)
Creating a dfm from a tokenizedTexts object ...
   ... indexing 2 documents
   ... shaping tokens into data.table, found 6 total tokens
   ... stemming the tokens (english)
   ... ignoring 174 feature types, discarding 1 total features (16.7%)
   ... summing tokens by document
   ... indexing 5 feature types
   ... building sparse matrix
   ... created a 2 x 5 sparse dfm
   ... complete. Elapsed time: 0.016 seconds.
Document-feature matrix of: 2 documents, 5 features.
2 x 5 sparse Matrix of class "dfmSparse"
       features
docs    air enjoy quot singl time
  text1   0     1    0     0    1
  text2   1     0    1     1    0

答案 2 :(得分:0)

@ cyberj0g的答案需要对tm(0.6)的最新版本进行少量修改。 更新后的代码可以编写如下:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))

感谢@ cyberj0g提供工作代码