我的数据框中有一列我需要使用分隔符" _"拆分成多列。但是,我需要在每行中仅保留输出中的最后两列(它将始终包含我需要的数据)。在许多记录中,分隔符的数量不同,因此在分割时会产生不同数量的列。我如何只获得每个观察的最后两列。以下是一些记录的例子
unique(data$tagid.1)
[1] tag id 00000_0_0900_226000013189
[3] 00000_0_0986_114100005288 00000_0_0900_226000132078
[5] 00000_0_09LA_00000_0_0900_226000 00000_0_0900_226000131998
[7] 0000_2004000000000847 00000_0_0900_22600001\a\0048\022LI
[9] 00000_0_0900_226000013189I 00000_0_0986_114100006473
我试图获得类似的输出:
tagid$C1 tagid$C2
0986 114100005288
0900 226000013189
0900 226000
etc.... etc....
我的解决方案存在一些问题,即它输出两行57k列并且速度慢,任何人都有比以下更好的解决方案:
> data.tag <- as.data.frame(data$tagid.1)
> tag1 <- cSplit(data.tag,"data$tagid.1",sep="_")
>
> head(tag1)
data$tagid.1_1 data$tagid.1_2 data$tagid.1_3 data$tagid.1_4 data$tagid.1_5 data$tagid.1_6 data$tagid.1_7
1: tag id NA NA NA NA NA NA
2: 00000 0 0900 226000013189 NA NA NA
3: 00000 0 0900 226000013189 NA NA NA
4: 00000 0 0900 226000013189 NA NA NA
5: 00000 0 0900 226000013189 NA NA NA
6: 00000 0 0900 226000013189 NA NA NA
>
> lastValue <- function(x) tail(x[!is.na(x)], 2)
> tag2 <- as.data.frame(apply(tag1, 1, lastValue))
> dim(tag2)
[1] 2 56997
答案 0 :(得分:4)
可以通过使用正则表达式实现此目的:
pat <- "^.*_(.*)_(.*)$"
data.tag <- data.frame(tagid.1 = c("tag id",
"00000_0_0900_226000013189",
"00000_0_0986_114100005288",
"00000_0_0900_226000132078",
"00000_0_0900_22600001\a\0048\022LI"))
data.frame(C1 = sub(pat, "\\1", data.tag[,1]),
C2 = sub(pat, "\\2", data.tag[,1]))
C1 C2
1 tag id tag id
2 0900 226000013189
3 0986 114100005288
4 0900 226000132078
5 0900 22600001\a\0048\022LI
答案 1 :(得分:1)
我们也可以使用strsplit
setNames(do.call(rbind.data.frame, lapply(strsplit(as.character(data.tag[,1]), "_"),
function(x) if(length(x)==1) rep(x, 2) else tail(x,2))), paste0("C", 1:2))
# C1 C2
#1 tag id tag id
#2 0900 226000013189
#3 0986 114100005288
#4 0900 226000132078
#5 0900 22600001\a\0048\022LI