我有一个巨大的.csv
文件,如下所示:
Transcript Id Gene Id(name) Mirna Name miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751
我想从中构建一个这样的矩阵:
Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267
我想在我的新矩阵中添加三个新列,名为UTR3
,MRE_score
和CDS
。
对于每个Gene ID
(例如ENST00000286800
),原始矩阵中有多个UTR3
(此处UTR3
为ENST00000286800
,一个UTR3
用于ENST00000345080
)我们选择第三列中得分最高的UTR3
。在新矩阵中,每个UTR3
的{{1}}值将是原始矩阵第二列中Gene ID
的值。
任何人都可以帮我改造这些数据并构建我的新矩阵吗?
答案 0 :(得分:3)
您可以尝试使用正则表达式构建CSV:
textfile <- "ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751"
txt <- readLines(textConnection(textfile))
sepr <- grepl("^ENST.*", txt)
r <- rle(sepr)
r <- r$lengths[!r$values]
regex <- "(\\S+)\\s+(\\S+)\\s(\\([^)]+\\)\\s+\\S+)\\s+(\\d+)"
m <- regexec(regex, txt[sepr])
m1 <- as.data.frame(t(sapply(regmatches(txt[sepr], m), "[", 2:5)))
m1 <- m1[rep(1:nrow(m1), r),]
regex <- "(\\S+)\\s+(\\S+)\\s+(\\S+)"
m <- regexec(regex, txt[!sepr])
m2 <- as.data.frame(t(sapply(regmatches(txt[!sepr], m), "[", 2:4)))
df <- cbind(m1, m2[,-1])
names(df) <- c("Transcript Id", "Gene Id(name)", "Mirna Name", "miTG score", "UTR3", "MRE_score" )
rownames(df) <- NULL
df
# Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score
# 1 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717114-30717142 0.05994568
# 2 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267
# 3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 6:105526681-105526709 0.133514751
答案 1 :(得分:1)
使用此测试数据:
Lines <- " Transcript Id Gene Id(name) Mirna Name miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751"
全部阅读并为输出设置名称nms
。然后使用累积和计算分组向量cs
。非重复项是每个组的第一行,重复项是以下行。按组合并这两组行,并提取每组中最高的MRE_score
:
DF <- read.table(text = Lines, header = TRUE, fill = TRUE, as.is = TRUE,
check.names = FALSE)
nms <- c("cs", names(DF)[1:5], "UTR3", "MRE_score") # out will have these names
DF$cs <- cumsum(!is.na(DF$Mirna)) # groups each ENST row with its UTR3 rows
dup <- duplicated(DF$cs) # FALSE for ENST rows and TRUE for UTR3 rows
both <- merge(DF[!dup, ], DF[dup, ], by = "cs")[c(1:6, 11:12)] # merge ENST & UTR3 rows
names(both) <- nms
both$MRE_score <- as.numeric(both$MRE_score)
Rank <- function(x) rank(x, ties.method = "first")
out <- both[ave(-both$MRE_score, both$cs, FUN = Rank) == 1, -1] # only keep largest score
我们得到:
> out
Transcript Id Gene Id(name) Mirna UTR3 MRE_score
2 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.1359127
3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 6:105526681-105526709 0.1335148
请注意,问题是指CDS
列,但它没有描述,也没有出现在示例输出中,因此我们忽略了它。