将向量中的元素拆分为不同的列

时间:2018-12-03 14:17:31

标签: r data.table fread

假设我有一个很大的TSV文件,其中有超过2000万行,如下所示:

    a b {"condition1":["ABC"], "condition3":false, "condition4":4000}
    c c {"condition1":["BBB"],"condition2":true}

我需要它看起来像:

     Var1 Var2 Condition1 Condition2 Condition3 Condition4
     a    b    ABC        NA         FALSE      4000
     c    c    BBB        TRUE       NA         NA

我尝试了以下代码,但它是: 一种。低效的 b。不起作用

在阅读时是否可以使用现成的解决方案来分隔第三列?

     dt<-fread(input = ifilename, header = T,encoding = "UTF-8" )
     output<-dt[,c("filter")]  #assume the third column named "filter"
     fwrite(x = output,file = "./DB/filter.csv",)
     filter.db<-fread(input ="./DB/filter.csv",fill=T)

2 个答案:

答案 0 :(得分:1)

可能的解决方案:

library(data.table)
library(jsonlite)

to_add <- rbindlist(lapply(dt$V3, function(x) setDT(fromJSON(x))), fill = TRUE)
setcolorder(to_add, sort(names(to_add)))

dt[, names(to_add) := to_add][, V3 := NULL][]

给出:

   V1 V2 condition1 condition2 condition3 condition4
1:  a  b        ABC         NA      FALSE       4000
2:  c  c        BBB       TRUE         NA         NA

使用的数据:

dt <- structure(list(V1 = c("a", "c"),
                     V2 = c("b", "c"),
                     V3 = c("{\"condition1\":[\"ABC\"], \"condition3\":false, \"condition4\":4000}",
                            "{\"condition1\":[\"BBB\"],\"condition2\":true}")),
                .Names = c("V1", "V2", "V3"), row.names = c(NA, -2L), class = c("data.table", "data.frame"))

答案 1 :(得分:0)

* nix工具在这种情况下可能会更快,因为R中的json解析器在我的测试中有点慢。

> library(data.table)
> aTbl = fread(cmd="cat foo.txt | grep -P -o '^\\w+\\s+\\w+'", header=F)
> aTbl
   V1 V2
1:  a  b
2:  c  c

> bTbl = fread(cmd="cat foo.txt | grep -P -o '[{].*$' | jq -r '[ .condition1[], .condition2, .condition3, .condition4 ] | @csv'", header=F)
> bTbl
    V1   V2    V3   V4
1: ABC   NA FALSE 4000
2: BBB TRUE    NA   NA

> setnames(aTbl, c('Var1', 'Var2'))
> setnames(bTbl, c('Condition1', 'Condition2', 'Condition3', 'Condition4'))

> cTbl = cbind(aTbl, bTbl)
> cTbl
   Var1 Var2 Condition1 Condition2 Condition3 Condition4
1:    a    b        ABC         NA      FALSE       4000
2:    c    c        BBB       TRUE         NA         NA
>