在r中拆分不规则列

时间:2016-09-06 22:49:53

标签: r split

我的数据框中有一列我需要使用分隔符" _"拆分成多列。但是,我需要在每行中仅保留输出中的最后两列(它将始终包含我需要的数据)。在许多记录中,分隔符的数量不同,因此在分割时会产生不同数量的列。我如何只获得每个观察的最后两列。以下是一些记录的例子

unique(data$tagid.1)
[1] tag id                                    00000_0_0900_226000013189                
[3] 00000_0_0986_114100005288                 00000_0_0900_226000132078                
[5] 00000_0_09LA_00000_0_0900_226000          00000_0_0900_226000131998                
[7] 0000_2004000000000847                     00000_0_0900_22600001\a\0048\022LI       
[9] 00000_0_0900_226000013189I                00000_0_0986_114100006473   

我试图获得类似的输出:

tagid$C1         tagid$C2
0986             114100005288    
0900             226000013189   
0900             226000
etc....          etc....

我的解决方案存在一些问题,即它输出两行57k列并且速度慢,任何人都有比以下更好的解决方案:

 > data.tag <- as.data.frame(data$tagid.1)
>   tag1 <- cSplit(data.tag,"data$tagid.1",sep="_")
> 
>   head(tag1)
   data$tagid.1_1 data$tagid.1_2 data$tagid.1_3 data$tagid.1_4 data$tagid.1_5          data$tagid.1_6 data$tagid.1_7
1:         tag id             NA             NA             NA             NA                 NA             NA
2:          00000              0           0900   226000013189             NA             NA             NA
3:          00000              0           0900   226000013189             NA             NA             NA
4:          00000              0           0900   226000013189             NA             NA             NA
5:          00000              0           0900   226000013189             NA             NA             NA
6:          00000              0           0900   226000013189             NA             NA             NA
>   
>   lastValue <- function(x)   tail(x[!is.na(x)], 2)
>   tag2 <- as.data.frame(apply(tag1, 1, lastValue)) 
> dim(tag2)
[1]     2 56997

2 个答案:

答案 0 :(得分:4)

可以通过使用正则表达式实现此目的:

pat <- "^.*_(.*)_(.*)$"
data.tag <- data.frame(tagid.1 = c("tag id",
                         "00000_0_0900_226000013189",
                         "00000_0_0986_114100005288",
                         "00000_0_0900_226000132078",
                          "00000_0_0900_22600001\a\0048\022LI"))
data.frame(C1 = sub(pat, "\\1", data.tag[,1]),
           C2 = sub(pat, "\\2", data.tag[,1]))


      C1                    C2
1 tag id                tag id
2   0900          226000013189
3   0986          114100005288
4   0900          226000132078
5   0900 22600001\a\0048\022LI

答案 1 :(得分:1)

我们也可以使用strsplit

执行此操作
setNames(do.call(rbind.data.frame, lapply(strsplit(as.character(data.tag[,1]), "_"), 
          function(x) if(length(x)==1) rep(x, 2) else tail(x,2))), paste0("C", 1:2))
#     C1                    C2
#1 tag id                tag id
#2   0900          226000013189
#3   0986          114100005288
#4   0900          226000132078
#5   0900 22600001\a\0048\022LI