我正在尝试将大量字符向量(2284879个元素和593.7 Mb)转换为数据帧。每个列表元素都是一个字符向量,其中包括四个字符串-它们是从4克列表中创建的。
class(words_split)
[1] "list"
length(words_split)
[1] 2284879
head(words_split)
[[1]]
[1] "the" "end" "of" "the"
[[2]]
[1] "the" "rest" "of" "the"
[[3]]
[1] "at" "the" "end" "of"
[[4]]
[1] "to" "be" "abl" "to"
[[5]]
[1] "at" "the" "same" "time"
[[6]]
[1] "in" "the" "middl" "of"
所需的结果将是:
[,1] [,2] [,3] [,4]
[1,] "the" "end" "of" "the"
[2,] "the" "rest" "of" "the"
[3,] "at" "the" "end" "of"
[4,] "to" "be" "abl" "to"
在搜索并尝试了各种方法之后,似乎do.call
和rbing
是解决方案。
words_table<-as.data.table(do.call(rbind,words_split))
但是结果有12列,而不是4列:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "the" "end" "of" "the" "the" "end" "of" "the" "the" "end" "of" "the"
[2,] "the" "rest" "of" "the" "the" "rest" "of" "the" "the" "rest" "of" "the"
[3,] "at" "the" "end" "of" "at" "the" "end" "of" "at" "the" "end" "of"
[4,] "to" "be" "abl" "to" "to" "be" "abl" "to" "to" "be" "abl" "to"
[5,] "at" "the" "same" "time" "at" "the" "same" "time" "at" "the" "same" "time"
[6,] "in" "the" "middl" "of" "in" "the" "middl" "of" "in" "the" "middl" "of"
如果我对words_split
的一部分(例如前4个元素)进行采样,然后执行相同的操作,则结果很好:
> words_head<-words_split[1:4]
> words_head
[[1]]
[1] "the" "end" "of" "the"
[[2]]
[1] "the" "rest" "of" "the"
[[3]]
[1] "at" "the" "end" "of"
[[4]]
[1] "to" "be" "abl" "to"
> class(words_head[1])
[1] "list"
> class(words_head[[1]])
[1] "character"
> words_head[[1]]
[1] "the" "end" "of" "the"
> words_head_comb<-do.call(rbind,words_head)
print(head(words_head_comb))
[,1] [,2] [,3] [,4]
[1,] "the" "end" "of" "the"
[2,] "the" "rest" "of" "the"
[3,] "at" "the" "end" "of"
[4,] "to" "be" "abl" "to"
为什么rbind()
会重复将我的列表再合并两次,当列表较大时,而列表较小时,似乎可以正常工作?
答案 0 :(得分:2)
正如注释中指出的,当所有元素的长度均为4时,结果将是正确的。 ,
words_split <- list(
c("the", "end", "of", "the"),
c("the", "rest", "of", "the" ),
c("at", "the", "end", "of"),
c("to", "be", "abl", "to"),
c("at", "the", "same", "time"),
c("in", "the", "middl", "of"))
do.call(rbind,words_split)
#R [,1] [,2] [,3] [,4]
#R [1,] "the" "end" "of" "the"
#R [2,] "the" "rest" "of" "the"
#R [3,] "at" "the" "end" "of"
#R [4,] "to" "be" "abl" "to"
#R [5,] "at" "the" "same" "time"
#R [6,] "in" "the" "middl" "of"
但是,如果其中一个元素长于四个,那么来自help("rbind")
的以下情感就很重要
如果所有参数均为向量,则结果中的列数(行)等于最长向量的长度。较短参数中的值将被回收以达到此长度(如果仅小数被回收,则使用
warning
。
因此,如果添加一个长度为12的元素,那么我们将得到
words_split[[7]] <- c(
"some", "char", "sequence", "which", "has", "length", "of",
"eights", "which", "is", "too", "long")
do.call(rbind,words_split)
#R [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#R [1,] "the" "end" "of" "the" "the" "end" "of" "the"
#R [2,] "the" "rest" "of" "the" "the" "rest" "of" "the"
#R [3,] "at" "the" "end" "of" "at" "the" "end" "of"
#R [4,] "to" "be" "abl" "to" "to" "be" "abl" "to"
#R [5,] "at" "the" "same" "time" "at" "the" "same" "time"
#R [6,] "in" "the" "middl" "of" "in" "the" "middl" "of"
#R [7,] "some" "char" "sequence" "which" "has" "length" "of" "eights"
#R [,9] [,10] [,11] [,12]
#R [1,] "the" "end" "of" "the"
#R [2,] "the" "rest" "of" "the"
#R [3,] "at" "the" "end" "of"
#R [4,] "to" "be" "abl" "to"
#R [5,] "at" "the" "same" "time"
#R [6,] "in" "the" "middl" "of"
#R [7,] "which" "is" "too" "long"
没有警告,因为12是4的倍数。