分割字符串并转换为数据帧

时间:2019-06-12 08:24:20

标签: r

我有一个65k元素字符向量,格式为。每个元素的长度都不同,但根据逗号的不同,长度范围为3到8。

b[1]= "aaaa, bbbb, cccc"
...
b[1000]="aaaa, bbbb, cccc, dddd, eeee, ffff"
...
b[3000]="aaaa, bbbb, cccc, dddd, eeee, ffff, gggg"
b[3001]="aaaa, bbbb, cccc"

我想转换为数据框:

row  col1 col2 col3 col4 col5 col6 col7
1    aaaa bbbb cccc
1000 aaaa bbbb cccc dddd eeee ffff
3000 aaaa bbbb cccc dddd eeee ffff gggg

我尝试过:

 data.frame( do.call( rbind, strsplit( b, ',' ) ) ) 

并得到:

  

警告信息:       在(函数(...,deparse.level = 1)中:         结果的列数不是向量长度(arg 1)的倍数

有什么建议吗?

2 个答案:

答案 0 :(得分:4)

将字符串粘贴在一起并与read.csv折叠后,我们可以使用"\n"

read.csv(text = paste0(b, collapse = "\n"), header = FALSE)

#    V1    V2    V3    V4    V5    V6    V7
#1 aaaa  bbbb  cccc                        
#2 aaaa  bbbb  cccc  dddd  eeee  ffff      
#3 aaaa  bbbb  cccc  dddd  eeee  ffff  gggg

如果您想以NA的形式读取空字符串,请在na.strings中指定它们

read.csv(text = paste0(b, collapse = "\n"), header = FALSE, na.strings = "")

另一个选择是stri_list2matrix中的stringi

data.frame(stringi::stri_list2matrix(strsplit(b, ","), byrow = TRUE))

#   X1    X2    X3    X4    X5    X6    X7
#1 aaaa  bbbb  cccc  <NA>  <NA>  <NA>  <NA>
#2 aaaa  bbbb  cccc  dddd  eeee  ffff  <NA>
#3 aaaa  bbbb  cccc  dddd  eeee  ffff  gggg

数据

b <- c("aaaa, bbbb, cccc", "aaaa, bbbb, cccc, dddd, eeee, ffff", 
       "aaaa, bbbb, cccc, dddd, eeee, ffff, gggg")

答案 1 :(得分:1)

我们可以使用fread中的data.table

library(data.table)
fread(paste(b, collapse="\n", sep=""), fill = TRUE)
#   V1   V2   V3   V4   V5   V6   V7
#1: aaaa bbbb cccc                    
#2: aaaa bbbb cccc dddd eeee ffff     
#3: aaaa bbbb cccc dddd eeee ffff gggg

数据

b <- c("aaaa, bbbb, cccc", "aaaa, bbbb, cccc, dddd, eeee, ffff", 
   "aaaa, bbbb, cccc, dddd, eeee, ffff, gggg")