使用多个文档从Corpus中删除行

时间:2016-01-07 02:11:41

标签: r tm

我在语料库中有4000个文本文档。我想从每个文档中删除包含特定单词的行作为数据清理的一部分。

例如:

library(tm)
doc.corpus<-  VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en"))

doc.corpus<- tm_map(doc.corpus, PlainTextDocument)

doc.corpus[[1]]

#PlainTextDocument
Metadata:  7
Content:  chars: 16542

    as.character(doc.corpus)[[1]]


$content


"Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities."
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation."

我的问题是删除第二行包含word&#34; trademark&#34;来自这个和所有其他文件。目前,我使用grepl()函数来识别行,并尝试使用在处理数据框时通常使用的方法来排除这些行,这种方法不起作用:

corpus.copy<-corpus.doc
corpus.doc[[1]]<-corpus.copy[[1]][!grepl("trademark",as.character(corpus.copy[[1]]),ignore.case = TRUE),]

只要它适用于第一个文档,我就可以轻松地使用&#34; for循环&#34;在Corpus内的所有文件中实施。

任何提示/解决方案表示赞赏。通过将语料库转换为数据框以删除不需要的行并再次转换回语料库,我可以轻松地使用替代路由。感谢。

System.info:
[1] "x86_64-w64-mingw32"; 
[1] "R version 3.1.0 (2014-04-10)"
[1] tm_0.6-2 

2 个答案:

答案 0 :(得分:1)

不需要for循环 - 尽管tm长期以来一直是一个令人沮丧的功能,但是当它们在语料库对象中时很难访问它们。

我已经将“row”的意思解释为文档 - 所以上面的示例是两个“行”。如果不是这种情况,那么这个答案需要(但很容易)调整。

试试这个:

txt <- c("Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities.",
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation.")

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("trademark", textVector, 
                                                  ignore.case = TRUE)]))

newCorp现在不包含包含“商标”的文件。请注意,如果您不需要复数(例如“商标”)

答案 1 :(得分:0)

Thank you Ken. Below is the small modification I made for my successful   implementation.

    require(tm)
    corp <- VCorpus(VectorSource(txt))
    textVector <- sapply(corp, as.character)
    for(j in seq(textVector)) {
    newCorp<-textVector
    newCorp[[j]] <- textVector[[j]][-grep("trademarks|trademark",    textVector[[j]], ignore.case = TRUE)]
    }

It seems 'textVector' contains a 'list' of documents. 'for' loop is still needed.