rbind大量的向量创建不必要的多余列

时间:2018-08-11 18:59:55

标签: r rbind do.call

我正在尝试将大量字符向量(2284879个元素和593.7 Mb)转换为数据帧。每个列表元素都是一个字符向量,其中包括四个字符串-它们是从4克列表中创建的。

class(words_split)
[1] "list"
length(words_split)
[1] 2284879
head(words_split)
[[1]]
[1] "the" "end" "of"  "the"
[[2]]
[1] "the"  "rest" "of"   "the" 
[[3]]
[1] "at"  "the" "end" "of" 
[[4]]
[1] "to"  "be"  "abl" "to" 
[[5]]
[1] "at"   "the"  "same" "time"
[[6]]
[1] "in"    "the"   "middl" "of" 

所需的结果将是:

    [,1]  [,2]   [,3]  [,4] 
[1,] "the" "end"  "of"  "the"
[2,] "the" "rest" "of"  "the"
[3,] "at"  "the"  "end" "of" 
[4,] "to"  "be"   "abl" "to" 

在搜索并尝试了各种方法之后,似乎do.callrbing是解决方案。

words_table<-as.data.table(do.call(rbind,words_split))

但是结果有12列,而不是4列:

    [,1]  [,2]   [,3]    [,4]   [,5]  [,6]   [,7]    [,8]   [,9]  [,10]  [,11]   [,12] 
[1,] "the" "end"  "of"    "the"  "the" "end"  "of"    "the"  "the" "end"  "of"    "the" 
[2,] "the" "rest" "of"    "the"  "the" "rest" "of"    "the"  "the" "rest" "of"    "the" 
[3,] "at"  "the"  "end"   "of"   "at"  "the"  "end"   "of"   "at"  "the"  "end"   "of"  
[4,] "to"  "be"   "abl"   "to"   "to"  "be"   "abl"   "to"   "to"  "be"   "abl"   "to"  
[5,] "at"  "the"  "same"  "time" "at"  "the"  "same"  "time" "at"  "the"  "same"  "time"
[6,] "in"  "the"  "middl" "of"   "in"  "the"  "middl" "of"   "in"  "the"  "middl" "of"  

如果我对words_split的一部分(例如前4个元素)进行采样,然后执行相同的操作,则结果很好:

> words_head<-words_split[1:4]
> words_head
[[1]]
[1] "the" "end" "of"  "the"

[[2]]
[1] "the"  "rest" "of"   "the" 

[[3]]
[1] "at"  "the" "end" "of" 

[[4]]
[1] "to"  "be"  "abl" "to" 
> class(words_head[1])
[1] "list"
> class(words_head[[1]])
[1] "character"
> words_head[[1]]
[1] "the" "end" "of"  "the"
> words_head_comb<-do.call(rbind,words_head)
print(head(words_head_comb))
     [,1]  [,2]   [,3]  [,4] 
[1,] "the" "end"  "of"  "the"
[2,] "the" "rest" "of"  "the"
[3,] "at"  "the"  "end" "of" 
[4,] "to"  "be"   "abl" "to" 

为什么rbind()会重复将我的列表再合并两次,当列表较大时,而列表较小时,似乎可以正常工作?

1 个答案:

答案 0 :(得分:2)

正如注释中指出的,当所有元素的长度均为4时,结果将是正确的。 ,

words_split <- list(
  c("the", "end", "of", "the"),
  c("the", "rest", "of", "the" ),
  c("at", "the", "end", "of"), 
  c("to", "be", "abl", "to"), 
  c("at", "the", "same", "time"), 
  c("in", "the", "middl", "of"))

do.call(rbind,words_split)
#R       [,1]  [,2]   [,3]    [,4]  
#R [1,] "the" "end"  "of"    "the" 
#R [2,] "the" "rest" "of"    "the" 
#R [3,] "at"  "the"  "end"   "of"  
#R [4,] "to"  "be"   "abl"   "to"  
#R [5,] "at"  "the"  "same"  "time"
#R [6,] "in"  "the"  "middl" "of"  

但是,如果其中一个元素长于四个,那么来自help("rbind")的以下情感就很重要

  

如果所有参数均为向量,则结果中的列数(行)等于最长向量的长度。较短参数中的值将被回收以达到此长度(如果仅小数被回收,则使用warning

因此,如果添加一个长度为12的元素,那么我们将得到

words_split[[7]] <- c(
  "some", "char", "sequence", "which", "has", "length", "of", 
  "eights", "which", "is", "too", "long")
do.call(rbind,words_split)
#R      [,1]   [,2]   [,3]       [,4]    [,5]  [,6]     [,7]    [,8]    
#R [1,] "the"  "end"  "of"       "the"   "the" "end"    "of"    "the"   
#R [2,] "the"  "rest" "of"       "the"   "the" "rest"   "of"    "the"   
#R [3,] "at"   "the"  "end"      "of"    "at"  "the"    "end"   "of"    
#R [4,] "to"   "be"   "abl"      "to"    "to"  "be"     "abl"   "to"    
#R [5,] "at"   "the"  "same"     "time"  "at"  "the"    "same"  "time"  
#R [6,] "in"   "the"  "middl"    "of"    "in"  "the"    "middl" "of"    
#R [7,] "some" "char" "sequence" "which" "has" "length" "of"    "eights"
#R      [,9]    [,10]  [,11]   [,12] 
#R [1,] "the"   "end"  "of"    "the" 
#R [2,] "the"   "rest" "of"    "the" 
#R [3,] "at"    "the"  "end"   "of"  
#R [4,] "to"    "be"   "abl"   "to"  
#R [5,] "at"    "the"  "same"  "time"
#R [6,] "in"    "the"  "middl" "of"  
#R [7,] "which" "is"   "too"   "long"

没有警告,因为12是4的倍数。