基于关键词的子集文档

时间:2017-04-20 04:35:04

标签: r tm

我有一个包含大量文本的大型数据集。我想通过使用关键字来分隔文档。

这是我的尝试:

t <- data.frame(T1 = c("A01", "A02", "A03", "A04", "A05", "A06"), T2 = c("Fargo is my bestest", 
"I like to read Mavis Gallant", "Read write and Fargo", "One flew over the cuckoo nest", "Bubba gump","Mariana marries Maris"))
t

   T1                            T2
1 A01           Fargo is my bestest
2 A02  I like to read Mavis Gallant
3 A03          Read write and Fargo
4 A04 One flew over the cuckoo nest
5 A05                    Bubba gump
6 A06         Mariana marries Maris


library(tm)
mydata.corpus <- Corpus(VectorSource(t$T2))
mydata.corpus <- tm_map(mydata.corpus, content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')), mc.cores=1)
mydata.corpus <- tm_map(mydata.corpus, content_transformer(tolower), mc.cores=1) 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE, mc.cores=1) 
my_stopwords <- c("Gallant", "Bestest", "One", "flew", "cuckoo", "Bubba", "Mariana")
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords, mc.cores=1)
mydata.corpus <- tm_map(mydata.corpus, removeNumbers, mc.cores=1) 

mydata.dtm <- TermDocumentMatrix(mydata.corpus) 
mydata.dtm

dim(mydata.dtm)
inspect(mydata.dtm)

我想要在其文本中包含fargo的子文档。看起来像doc 1,doc 3有'fargo'这个词。

         Docs
Terms     1 2 3 4 5 6
  and     0 0 1 0 0 0
  bestest 1 0 0 0 0 0
  bubba   0 0 0 0 1 0
  fargo   1 0 1 0 0 0
  gallant 0 1 0 0 0 0
  gump    0 0 0 0 1 0
  like    0 1 0 0 0 0
  mariana 0 0 0 0 0 1
  maris   0 0 0 0 0 1
  marries 0 0 0 0 0 1
  mavis   0 1 0 0 0 0
  nest    0 0 0 1 0 0
  one     0 0 0 1 0 0
  over    0 0 0 1 0 0
  read    0 1 1 0 0 0
  the     0 0 0 1 0 0
  write   0 0 1 0 0 0

我想要一些代码来获得这样的输出。

   T1                   T2
1 A01  Fargo is my bestest
2 A03 Read write and Fargo

0 个答案:

没有答案