我有一个包含大量文本的大型数据集。我想通过使用关键字来分隔文档。
这是我的尝试:
t <- data.frame(T1 = c("A01", "A02", "A03", "A04", "A05", "A06"), T2 = c("Fargo is my bestest",
"I like to read Mavis Gallant", "Read write and Fargo", "One flew over the cuckoo nest", "Bubba gump","Mariana marries Maris"))
t
T1 T2
1 A01 Fargo is my bestest
2 A02 I like to read Mavis Gallant
3 A03 Read write and Fargo
4 A04 One flew over the cuckoo nest
5 A05 Bubba gump
6 A06 Mariana marries Maris
library(tm)
mydata.corpus <- Corpus(VectorSource(t$T2))
mydata.corpus <- tm_map(mydata.corpus, content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')), mc.cores=1)
mydata.corpus <- tm_map(mydata.corpus, content_transformer(tolower), mc.cores=1)
mydata.corpus <- tm_map(mydata.corpus, removePunctuation, preserve_intra_word_dashes=TRUE, mc.cores=1)
my_stopwords <- c("Gallant", "Bestest", "One", "flew", "cuckoo", "Bubba", "Mariana")
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords, mc.cores=1)
mydata.corpus <- tm_map(mydata.corpus, removeNumbers, mc.cores=1)
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm
dim(mydata.dtm)
inspect(mydata.dtm)
我想要在其文本中包含fargo
的子文档。看起来像doc 1,doc 3有'fargo'这个词。
Docs
Terms 1 2 3 4 5 6
and 0 0 1 0 0 0
bestest 1 0 0 0 0 0
bubba 0 0 0 0 1 0
fargo 1 0 1 0 0 0
gallant 0 1 0 0 0 0
gump 0 0 0 0 1 0
like 0 1 0 0 0 0
mariana 0 0 0 0 0 1
maris 0 0 0 0 0 1
marries 0 0 0 0 0 1
mavis 0 1 0 0 0 0
nest 0 0 0 1 0 0
one 0 0 0 1 0 0
over 0 0 0 1 0 0
read 0 1 1 0 0 0
the 0 0 0 1 0 0
write 0 0 1 0 0 0
我想要一些代码来获得这样的输出。
T1 T2
1 A01 Fargo is my bestest
2 A03 Read write and Fargo