如何从语料库中删除无意义的单词?

时间:2019-07-02 03:41:38

标签: r text nlp data-cleaning

我是R的新手,正在尝试从语料库中删除无意义的单词。我有一个数据框,其中一列包含电子邮件,而另一列包含目标变量。我正在尝试清除电子邮件正文数据。我已经为此使用了tm和qdap软件包。 我已经经历了其他大多数问题,并尝试了以下示例: Remove meaningless words from corpus in R 我遇到的问题是,当我想从语料库中删除不需要的标记(不是字典词)时,出现错误。

library(qdap)
library(tm)

corpus = Corpus(VectorSource(Email$Body))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, stripWhitespace)

corpus = tm_map(corpus, stemDocument)

tdm = TermDocumentMatrix(corpus)
all_tokens = findFreqTerms(tdm,1)
tokens_to_remove = setdiff(all_tokens, GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), tokens_to_remove)

通过运行上面的代码行,我得到以下错误。

  invalid regular expression '(*UCP)\b(zyx|zyer|zxxxxxâ|zxxxxx|zwischenzeit|zwei|zvolen|zverejneni|zurã|zum|zstepswc|zquez|zprã|zorunlulu|zona|zoho|znis|zmir|zlf|zink|zierk|zhou|zhodnoteni|zgyã|zgã|zfs|zfbeswstat|zerust|zeroâ|zeppelinstr|zellerstrass|zeldir|zel|zdanska|zcfqc|zaventem|zarecka|zarardan|zaragoza|zaobchã|zamã|zakã|zaira|zahradnikova|zagorska|zagã|zachyti|zabih|zã|yusof|yukinobu|yui|ypg|ypaint|youtub|yoursid|youâ|yoshitada|yorkshir|yollayan|yokohama|yoganandam|yiewsley|yhlhjpz|yer|yeovil|yeni|yeatman|yazarina|yazaki|yaz|yasakt|yarm|yara|yannick|yanlislikla|yakar|yaiza|yabortslitem|yã|xxxxx|xxxxgbl|xuezi|xuefeng|xprn|xma|xlsx|xjchvnbbafeg|xiii|xii|xiaonan|xgb|xcede|wythenshaw|wys|wydzial|wydzia|wycomb|www|wuppert|wroclaw|wroc|wrightâ|wpisana|woustvil|wouldnâ|worthwhil|worsley|worri|worldwid|worldâ|workwear|worcestershir|worc|wootton|wooller|woodtec|woodsid|woodmansey|woodley|woodham|woodgat|wonâ|wolverhampton|wjodoyg|wjgfjiq|witti|witt|witkowski|wiss
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'regular expression is too large'
    at ''

电子邮件样本语料库:

[794] "c mailto sent march ne rntbci accountspay nmuk subject new sig plc item still new statement await retriev use link connect account connect account link work copi past follow text address bar top internet browser https od datainterconnect com sigd sigdodsaccount php p thgqmdz d dt s contact credit control contact experi technic problem visit http bau faq datainterconnect com sig make payment call autom credit debit card payment line sig may abl help improv cashflow risk manag retent recoveri contract disput via www sigfinancetool co uk websit provid detail uniqu award win servic care select third parti avail sig custom power" 

tokens_to_remove[1:10]
 [1] "advis"        "appli"        "atlassian"    "bosch"        "boschrexroth" "busi"        
 [7] "communic"     "dcen"         "dcgbsom"      "email" 

我想删除所有在英语中没有意义的单词,例如c,mailto,ne,accountpay,nmuk等。

1 个答案:

答案 0 :(得分:0)

我将按照以下步骤进行操作:

library("readtext")
library(quanteda)
library(dplyr)
mytext<- c("Carles werwa went to sadaf buy trsfr in the supermanket", 
           "Marta needs to werwa sadaf go to Jamaica") # My corpus
tokens_to_remove<-c("werwa" ,"sadaf","trsfr")                         # My dictionary
TokenizedText<-tokens(mytext, 
                        remove_punct = TRUE, 
                        remove_numbers = TRUE)            # Tokenizing the words. You can input an english dictionary
mytextClean<- lapply(TokenizedText, function(x) setdiff(x, tokens_to_remove))          # setting the difference between both

mytextClean
$text1
[1] "Carles"      "went"        "to"          "buy"         "in"          "the"         "supermanket"

$text2
[1] "Marta"   "needs"   "to"      "go"      "Jamaica"

Tokens_to_remove也可以是英语词典,然后可以使用setdiff()代替intersect()