将文档从tm语料库拆分为多个文档

时间:2015-06-17 20:31:29

标签: regex r split tm text-analysis

有一个奇怪的问题,有没有办法将使用语料库函数导入的语料库文档拆分成多个文档,然后可以在我的语料库中作为单独的文档重新读取?例如,如果我使用 inspect(documents[1])并且有类似

的内容
`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`

`[[1]]`

`<<PlainTextDocument (metadata: 7)>>`

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

Hi mom

Purple is my favorite color

I want to split after this line!!!

Words

And stuff

我希望在短语“我想在此行之后拆分!!!”之后拆分文档在这种情况下出现两次可能吗?

使用inspect(documents)

后,最终结果如下所示

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

  

[[1]]

     

<<PlainTextDocument (metadata: 7)>>

     

快速的棕色狐狸跳过懒狗

     

我认为猫真的很酷

     

我想在这一行之后拆分!!!

     

[[2]]

     

<<PlainTextDocument (metadata: 7)>>

     

嗨妈妈

     

紫色是我最喜欢的颜色

     

我想在这一行之后拆分!!!

     

[[3]]

     

<<PlainTextDocument (metadata: 7)>>

     

     

和东西

2 个答案:

答案 0 :(得分:3)

您可以使用strsplit拆分文档,然后重新创建语料库:

Corpus(VectorSource(
          strsplit(as.character(documents[[1]]),  ## coerce to character
          "I want to split after this line!!!",   
          fixed=TRUE)[[1]]))       ## use fixed=T since you  have special
                                   ## characters in your separator  

为了测试这个,我们应该首先创建一个可重现的例子:

documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))

然后应用以前的解决方案:

split.docs <- Corpus(VectorSource(
  strsplit(as.character(documents[[1]]),  ## coerce to character
           "I want to split after this line!!!",   
           fixed=TRUE)[[1]]))  

现在检查解决方案:

inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool


[[2]]
<<PlainTextDocument (metadata: 7)>>

Hi mom
Purple is my favorite color


[[3]]
<<PlainTextDocument (metadata: 7)>>

Words
And stuff

看起来strsplit删除了分隔符:)

答案 1 :(得分:2)

这是一种更简单的方法,使用quanteda包:

require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")

这会产生一个长度= 1的列表(因为如果你愿意的话,它可以用于多个文档),但如果你只想要一个向量,你可以随时unlist()

[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n    \nHi mom\n\nPurple is my favorite color\n\n"                               
[3] "\n    \nWords\n\nAnd stuff" 

可以使用quantedacorpus(mytextSegmented)语料库将其读回tm语料库进行后续处理。