Question

有一个奇怪的问题，有没有办法将使用语料库函数导入的语料库文档拆分成多个文档，然后可以在我的语料库中作为单独的文档重新读取？例如，如果我使用 inspect(documents[1])并且有类似

的内容

`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`

`[[1]]`

`<<PlainTextDocument (metadata: 7)>>`

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

Hi mom

Purple is my favorite color

I want to split after this line!!!

Words

And stuff

我希望在短语“我想在此行之后拆分!!!”之后拆分文档在这种情况下出现两次可能吗？

使用inspect(documents)

后，最终结果如下所示

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

快速的棕色狐狸跳过懒狗

我认为猫真的很酷

我想在这一行之后拆分!!!

[[2]]

<<PlainTextDocument (metadata: 7)>>

嗨妈妈

紫色是我最喜欢的颜色

我想在这一行之后拆分!!!

[[3]]

<<PlainTextDocument (metadata: 7)>>

词

和东西

Answer 1

您可以使用strsplit拆分文档，然后重新创建语料库：

Corpus(VectorSource(
          strsplit(as.character(documents[[1]]),  ## coerce to character
          "I want to split after this line!!!",   
          fixed=TRUE)[[1]]))       ## use fixed=T since you  have special
                                   ## characters in your separator

为了测试这个，我们应该首先创建一个可重现的例子：

documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))

然后应用以前的解决方案：

split.docs <- Corpus(VectorSource(
  strsplit(as.character(documents[[1]]),  ## coerce to character
           "I want to split after this line!!!",   
           fixed=TRUE)[[1]]))

现在检查解决方案：

inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool


[[2]]
<<PlainTextDocument (metadata: 7)>>

Hi mom
Purple is my favorite color


[[3]]
<<PlainTextDocument (metadata: 7)>>

Words
And stuff

看起来strsplit删除了分隔符:)

Answer 2

这是一种更简单的方法，使用quanteda包：

require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")

这会产生一个长度= 1的列表（因为如果你愿意的话，它可以用于多个文档），但如果你只想要一个向量，你可以随时unlist()。

[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n    \nHi mom\n\nPurple is my favorite color\n\n"                               
[3] "\n    \nWords\n\nAnd stuff"

可以使用quanteda或corpus(mytextSegmented)语料库将其读回tm语料库进行后续处理。

将文档从tm语料库拆分为多个文档

2 个答案: