有一个奇怪的问题,有没有办法将使用语料库函数导入的语料库文档拆分成多个文档,然后可以在我的语料库中作为单独的文档重新读取?例如,如果我使用
inspect(documents[1])
并且有类似
`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`
`[[1]]`
`<<PlainTextDocument (metadata: 7)>>`
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff
我希望在短语“我想在此行之后拆分!!!”之后拆分文档在这种情况下出现两次可能吗?
使用inspect(documents)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
快速的棕色狐狸跳过懒狗
我认为猫真的很酷
我想在这一行之后拆分!!!
[[2]]
<<PlainTextDocument (metadata: 7)>>
嗨妈妈
紫色是我最喜欢的颜色
我想在这一行之后拆分!!!
[[3]]
<<PlainTextDocument (metadata: 7)>>
词
和东西
答案 0 :(得分:3)
您可以使用strsplit
拆分文档,然后重新创建语料库:
Corpus(VectorSource(
strsplit(as.character(documents[[1]]), ## coerce to character
"I want to split after this line!!!",
fixed=TRUE)[[1]])) ## use fixed=T since you have special
## characters in your separator
为了测试这个,我们应该首先创建一个可重现的例子:
documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))
然后应用以前的解决方案:
split.docs <- Corpus(VectorSource(
strsplit(as.character(documents[[1]]), ## coerce to character
"I want to split after this line!!!",
fixed=TRUE)[[1]]))
现在检查解决方案:
inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool
[[2]]
<<PlainTextDocument (metadata: 7)>>
Hi mom
Purple is my favorite color
[[3]]
<<PlainTextDocument (metadata: 7)>>
Words
And stuff
看起来strsplit
删除了分隔符:)
答案 1 :(得分:2)
这是一种更简单的方法,使用quanteda
包:
require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")
这会产生一个长度= 1的列表(因为如果你愿意的话,它可以用于多个文档),但如果你只想要一个向量,你可以随时unlist()
。
[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n \nHi mom\n\nPurple is my favorite color\n\n"
[3] "\n \nWords\n\nAnd stuff"
可以使用quanteda
或corpus(mytextSegmented)
语料库将其读回tm
语料库进行后续处理。