Question

我正在尝试将大量字符向量（2284879个元素和593.7 Mb）转换为数据帧。每个列表元素都是一个字符向量，其中包括四个字符串-它们是从4克列表中创建的。

class(words_split)
[1] "list"
length(words_split)
[1] 2284879
head(words_split)
[[1]]
[1] "the" "end" "of"  "the"
[[2]]
[1] "the"  "rest" "of"   "the" 
[[3]]
[1] "at"  "the" "end" "of" 
[[4]]
[1] "to"  "be"  "abl" "to" 
[[5]]
[1] "at"   "the"  "same" "time"
[[6]]
[1] "in"    "the"   "middl" "of"

所需的结果将是：

    [,1]  [,2]   [,3]  [,4] 
[1,] "the" "end"  "of"  "the"
[2,] "the" "rest" "of"  "the"
[3,] "at"  "the"  "end" "of" 
[4,] "to"  "be"   "abl" "to"

在搜索并尝试了各种方法之后，似乎do.call和rbing是解决方案。

words_table<-as.data.table(do.call(rbind,words_split))

但是结果有12列，而不是4列：

    [,1]  [,2]   [,3]    [,4]   [,5]  [,6]   [,7]    [,8]   [,9]  [,10]  [,11]   [,12] 
[1,] "the" "end"  "of"    "the"  "the" "end"  "of"    "the"  "the" "end"  "of"    "the" 
[2,] "the" "rest" "of"    "the"  "the" "rest" "of"    "the"  "the" "rest" "of"    "the" 
[3,] "at"  "the"  "end"   "of"   "at"  "the"  "end"   "of"   "at"  "the"  "end"   "of"  
[4,] "to"  "be"   "abl"   "to"   "to"  "be"   "abl"   "to"   "to"  "be"   "abl"   "to"  
[5,] "at"  "the"  "same"  "time" "at"  "the"  "same"  "time" "at"  "the"  "same"  "time"
[6,] "in"  "the"  "middl" "of"   "in"  "the"  "middl" "of"   "in"  "the"  "middl" "of"

如果我对words_split的一部分（例如前4个元素）进行采样，然后执行相同的操作，则结果很好：

> words_head<-words_split[1:4]
> words_head
[[1]]
[1] "the" "end" "of"  "the"

[[2]]
[1] "the"  "rest" "of"   "the" 

[[3]]
[1] "at"  "the" "end" "of" 

[[4]]
[1] "to"  "be"  "abl" "to" 
> class(words_head[1])
[1] "list"
> class(words_head[[1]])
[1] "character"
> words_head[[1]]
[1] "the" "end" "of"  "the"
> words_head_comb<-do.call(rbind,words_head)
print(head(words_head_comb))
     [,1]  [,2]   [,3]  [,4] 
[1,] "the" "end"  "of"  "the"
[2,] "the" "rest" "of"  "the"
[3,] "at"  "the"  "end" "of" 
[4,] "to"  "be"   "abl" "to"

为什么rbind()会重复将我的列表再合并两次，当列表较大时，而列表较小时，似乎可以正常工作？

Answer 1

正如注释中指出的，当所有元素的长度均为4时，结果将是正确的。，

words_split <- list(
  c("the", "end", "of", "the"),
  c("the", "rest", "of", "the" ),
  c("at", "the", "end", "of"), 
  c("to", "be", "abl", "to"), 
  c("at", "the", "same", "time"), 
  c("in", "the", "middl", "of"))

do.call(rbind,words_split)
#R       [,1]  [,2]   [,3]    [,4]  
#R [1,] "the" "end"  "of"    "the" 
#R [2,] "the" "rest" "of"    "the" 
#R [3,] "at"  "the"  "end"   "of"  
#R [4,] "to"  "be"   "abl"   "to"  
#R [5,] "at"  "the"  "same"  "time"
#R [6,] "in"  "the"  "middl" "of"

但是，如果其中一个元素长于四个，那么来自help("rbind")的以下情感就很重要

如果所有参数均为向量，则结果中的列数（行）等于最长向量的长度。较短参数中的值将被回收以达到此长度（如果仅小数被回收，则使用warning。

因此，如果添加一个长度为12的元素，那么我们将得到

words_split[[7]] <- c(
  "some", "char", "sequence", "which", "has", "length", "of", 
  "eights", "which", "is", "too", "long")
do.call(rbind,words_split)
#R      [,1]   [,2]   [,3]       [,4]    [,5]  [,6]     [,7]    [,8]    
#R [1,] "the"  "end"  "of"       "the"   "the" "end"    "of"    "the"   
#R [2,] "the"  "rest" "of"       "the"   "the" "rest"   "of"    "the"   
#R [3,] "at"   "the"  "end"      "of"    "at"  "the"    "end"   "of"    
#R [4,] "to"   "be"   "abl"      "to"    "to"  "be"     "abl"   "to"    
#R [5,] "at"   "the"  "same"     "time"  "at"  "the"    "same"  "time"  
#R [6,] "in"   "the"  "middl"    "of"    "in"  "the"    "middl" "of"    
#R [7,] "some" "char" "sequence" "which" "has" "length" "of"    "eights"
#R      [,9]    [,10]  [,11]   [,12] 
#R [1,] "the"   "end"  "of"    "the" 
#R [2,] "the"   "rest" "of"    "the" 
#R [3,] "at"    "the"  "end"   "of"  
#R [4,] "to"    "be"   "abl"   "to"  
#R [5,] "at"    "the"  "same"  "time"
#R [6,] "in"    "the"  "middl" "of"  
#R [7,] "which" "is"   "too"   "long"

没有警告，因为12是4的倍数。

rbind大量的向量创建不必要的多余列

1 个答案: