我是R的新手,我正在做我的第一份生物信息学任务。对于任何愚蠢的错误,请原谅。我在循环数据框时遇到了困难。我想要做的是从文件中读取一系列dna并将其转换为氨基酸密码子。当我试图将3核碱基转化为密码子时,问题就出现了。
我的bases.csv
文件如下所示,含有3个核酸碱基的密码子。内容如下:
a, b, c, amino
A, G, G, R
A, G, A, R
A, G, C, S
A, G, T, S
A, A, G, K
A, A, A, K
A, A, C, N
A, A, T, N
A, C, G, T
A, C, A, T
A, C, C, T
A, C, T, T
A, T, G, M
A, T, A, I
A, T, C, I
A, T, T, I
C, G, G, R
C, G, A, R
C, G, C, R
C, G, T, R
C, A, G, Q
C, A, A, Q
C, A, C, H
C, A, T, H
C, C, G, P
C, C, A, P
C, C, C, P
C, C, T, P
C, T, G, L
C, T, A, L
C, T, C, L
C, T, T, L
T, G, G, W
T, G, C, C
T, G, T, C
T, A, C, Y
T, A, T, Y
T, C, G, S
T, C, A, S
T, C, C, S
T, C, T, S
T, T, G, L
T, T, A, L
T, T, C, F
T, T, T, F
G, G, G, G
G, G, A, G
G, G, C, G
G, G, T, G
G, A, G, E
G, A, A, E
G, A, C, D
G, A, T, D
G, C, G, A
G, C, A, A
G, C, C, A
G, C, T, A
G, T, G, V
G, T, A, V
G, T, C, V
G, T, T, V
编辑: 完整的源代码:
lines <- c(readLines("demo.txt"))
bases <- read.csv(file="bases.csv", header=TRUE, sep=",")
gene_start <- FALSE
gene <- ""
amino <- ""
convert_to_amino <- function(a, b, c) {
index <- 1
while (index <= nrow(bases)) {
if ((bases[index, 'a'] == a) && (bases[index, 'b'] == b) && (bases[index, 'c'] == c)) {
return (bases[index, 'amino'])
}
index <- index + 1
}
}
lines <- strsplit(lines, "")[[1]]
i <- 0
while ((i + 2 < length(lines))) {
if (gene_start == FALSE) {
if (lines[i] == 'A' && lines[i + 1] == 'T' && lines[i + 2] == 'G') {
gene_start <- TRUE
print ("gene found")
amino <- convert_to_amino(lines[i], lines[i+1], lines[i+2])
gene <- paste(gene, amino, sep="")
i <- i + 3
}
i <- i + 1
}
else if (gene_start == TRUE) {
if ((lines[i] == 'T' && lines[i + 1] == 'A' && lines[i + 2] == 'A') || (lines[i] == 'T' && lines[i + 1] == 'A' && lines[i + 2] == 'G') || (lines[i] == 'T' && lines[i + 1] == 'G' && lines[i + 2] == 'A')) {
gene_start <- FALSE
print (gene)
gene <- ""
}
else {
amino <- convert_to_amino(lines[i], lines[i+1], lines[i+2])
gene <- paste(gene, amino, sep="")
}
i <- i + 3
}
else
i <- i + 1
}
我想在这里实现的是检查数据框中是否存在a
,b
,c
碱基的组合。如果是,则为其分配相应的氨基酸代码。
但根据输出,if
条件永远不会满足。关于这里错误的任何指示都会有所帮助。
答案 0 :(得分:2)
我确实找到了一种似乎有用的方法,但它很笨拙
首先,不是将bases
作为4列数据帧,而是使用2列数据帧。
ref=bases
ref$amino=as.character(ref$amino)
ref$codon=paste0(ref$a,ref$b,ref$c)
ref$codon=gsub(" ","",ref$codon)
ref=ref[,c("amino","codon")]
看起来像:
amino codon
1 R AGG
2 R AGA
3 S AGC
4 S AGT
5 K AAG
现在使用demo
:CATGTTTCCACTTACAGATCCTTCAAAAAGAGTGTTTCAAAACTGCTCTATGA
(只是你的一个样本)
demo="CATGTTTCCACTTACAGATCCTTCAAAAAGAGTGTTTCAAAACTGCTCTATGA"
demo<- strsplit(demo, "(?<=.{3})", perl = TRUE)[[1]]
这变成3个字母的位,密码子(最后一位只有2个,因为我随机选择了样本的长度)
> demo
[1] "CAT" "GTT" "TCC" "ACT" "TAC" "AGA" "TCC" "TTC" "AAA" "AAG" "AGT" "GTT" "TCA" "AAA" "CTG" "CTC" "TAT" "GA"
然后我将每个密码子与来自ref
的参考文献相关联:
sapply(demo,function(x)ref$amino[x==ref$codon])
给出(仅一个样本):
$CAT
[1] " H"
$GTT
[1] " V"
$TCC
[1] " S"
$ACT
[1] " T"
这是一个列表,因此格式可能需要进一步修改。结果与您的参考文献相符。
答案 1 :(得分:1)
我们可以将序列转换为3列数据帧,然后与碱基合并:
# dummy input
x <- "CATGTTTCCACTTACAGATCCTTCAAAAAGAGTGTTTCAAAACTGCTCTATGAAAAGGAATGTTCAACTCTGTGAGTTAAATAAAAGCAT"
nchar(x)
# [1] 90
# convert input to a dataframe with 3 columns matching bases columns
xdf <- data.frame(t(matrix(unlist(strsplit(x, "")), nrow = 3)))
colnames(xdf) <- colnames(bases)[1:3]
xdf$ix <- seq(nrow(xdf))
head(xdf)
# a b c ix
# 1 C A T 1
# 2 G T T 2
# 3 T C C 3
# 4 A C T 4
# 5 T A C 5
# 6 A G A 6
# merge to get amino column
xdfAmino <- merge(xdf, bases, all.x = TRUE)
# mark non-matches with "_"
xdfAmino$amino[is.na(xdfAmino$amino)] <- "_"
# result
paste(xdfAmino$amino[order(xdfAmino$ix)], collapse = "")
#[1] "HVSTYRSFKKSVSKLLYEKECSTL_VK_KH"