我的数据如下所示
dft<- structure(list(ATM1 = c(0.61048, 0.46609, 0.52073, 0.78661, 0.46614,
0.60211, NA), ATM2 = c(NA, 0.874645, NA, 0.94743, NA, 0.984454,
NA), ATM3 = c(NA, NA, NA, 0.343564, 0.163544, 0.765422, NA)), .Names = c("ATM1",
"ATM2", "ATM3"), row.names = c("A0AV96", "A0FGR8", "2A3N6;O14986;O14617",
"A1L020", "P54792;O14640", "CON__P15497", "Q9H3Y6;CON__H-INV:HIT000016045"
), class = "data.frame")
行名称看起来像这样
A0AV96
A0FGR8
2A3N6;O14986;O14617
A1L020
P54792;O14640
CON__P15497
Q9H3Y6;CON__H-INV:HIT000016045
我想删除 CON __ 或是的任何字符串的一部分 CON__H-INV:HIT000016045
然后我想转移那些字符串;作为具有相同值的新行。例如,上面的输出应该如下所示
ATM1 ATM2 ATM3
A0AV96 0.61048 NA NA
A0FGR8 0.46609 0.874645 NA
2A3N6 0.52073 NA NA
O14986 0.52073 NA NA
O14617 0.52073 NA NA
A1L020 0.78661 0.947430 0.343564
P54792 0.46614 NA 0.163544
O14640 0.46614 NA 0.163544
P15497 0.60211 0.984454 0.765422
Q9H3Y6 NA NA NA
作为一个例子,第三行有三个字符串分隔;如2A3N6; O14986; O14617 他们应该创建两个新行,并且它们的位置相同。
输出就像这样
temp <- strsplit(gsub("(CON__|CON__H-INV:HIT000016045)", "", rownames(dft)),";")
> # use length of list to "grow" dataframe
> dftNew <- dft[rep(seq_along(temp), sapply(temp, length)), ]
> temp <- unlist(temp)
> temp[duplicated(temp)] <- paste(temp[duplicated(temp)],
+ seq_along(temp[duplicated(temp)]), sep=".")
>
> rownames(dftNew) <- unlist(temp)
> dftNew$id <- rep(seq_along(temp), sapply(temp, length))
> dftNew
ATM1 ATM2 ATM3 id
A0AV96 0.61048 NA NA 1
A0FGR8 0.46609 0.874645 NA 2
2A3N6 0.52073 NA NA 3
O14986 0.52073 NA NA 4
O14617 0.52073 NA NA 5
A1L020 0.78661 0.947430 0.343564 6
P54792 0.46614 NA 0.163544 7
O14640 0.46614 NA 0.163544 8
P15497 0.60211 0.984454 0.765422 9
Q9H3Y6 NA NA NA 10
答案 0 :(得分:2)
此基本R代码可以使用
# get list of rownames, with CON_ stuff dropped and split on ";"
temp <- strsplit(gsub("(CON__|CON__H-INV:HIT000016045)", "", rownames(dft)),";")
# use length of list to "grow" dataframe
dftNew <- dft[rep(seq_along(temp), sapply(temp, length)), ]
# apply new row names
rownames(dftNew) <- unlist(temp)
dftNew
ATM1 ATM2 ATM3
A0AV96 0.61048 NA NA
A0FGR8 0.46609 0.874645 NA
2A3N6 0.52073 NA NA
O14986 0.52073 NA NA
O14617 0.52073 NA NA
A1L020 0.78661 0.947430 0.343564
P54792 0.46614 NA 0.163544
O14640 0.46614 NA 0.163544
P15497 0.60211 0.984454 0.765422
Q9H3Y6 NA NA NA
评论1
如果最后一行中有重复的rownames,您将收到一条警告消息。例如,data.frame仍然可以正常工作,但您无法将其打印到屏幕上。给出这种方法最简单的解决方案是在dupe中添加下标,如下所示:
# apply new row names with dupes
temp <- unlist(temp)
temp[duplicated(temp)] <- paste(temp[duplicated(temp)],
seq_along(temp[duplicated(temp)]), sep=".")
rownames(dftNew) <- unlist(temp)
评论2
要添加ID变量以将dft中原始观测的行号映射到dft2中的新观测值,您可以重复使用以前的代码:
temp <- strsplit(gsub("(CON__|CON__H-INV:HIT000016045)", "", rownames(dft)),";")
dftNew$id <- rep(seq_along(temp), sapply(temp, length))