我尝试比较有和没有SNP的大序列数据,并将snps标记为非同义或同义。我有来自PLNIK的.fasta
序列和.bim
文件,保守(参考)和替代核苷酸。:
head(test)
pos ALT REF
1 2 G T
2 8 G T
3 65 C G
4 68 C G
5 77 T C
6 78 G C
我可以用替代核苷酸取代参考核苷酸:
ref[test$pos]=as.vector(test$ALT)
我需要说,替代会导致氨基酸的变化与否。我想使用seqinr
包裹,也许我的方法不对?
所以我有2个字符串,它们是序列(alt
向量中的替代核苷酸用上部寄存器标记):
ref=c("a","t","g","t","c","g","t","c","g","g","c","c","g","c","g","g","g","c",
"c","a","a","g","a","c","a","a","c","g","g","a","g","a","t","a","c","c",
"g","c","t","g","g","g","g","a","c","t","a","c","a","t","c","a","a","g",
"t","g","g","a","t","g","t","g","c","g","g","c","g","c","c","g","g","t",
"g","g","c","c","g","t","g","c","g","g","g","c","g","g","c","g","c","c",
"a","t","g","g","c","c","a","a","c","c","t","c","c","a","g","c","g","c",
"g","g","c","g","t","t","g","g","c","t","c","c","c","t","c","g","t","c",
"c","g","t","g","a","c","a","t","t","g","g","c","g","a","c","c","c","c",
"t","g","c","c","t","c","a","a","c","c","c","a","t","c","c","c","c","c",
"g","t","t","a","a","g")
alt=c("a","G","g","t","c","g","t","G","g","g","c","c","g","c","g","g","g","c",
"c","a","a","g","a","c","a","a","c","g","g","a","g","a","t","a","c","c",
"g","c","t","g","g","g","g","a","c","t","a","c","a","t","c","a","a","g",
"t","g","g","a","t","g","t","g","c","g","C","c","g","C","c","g","g","t",
"g","g","c","c","T","G","g","c","g","g","C","c","g","g","c","g","c","c",
"a","t","g","g","c","c","a","a","c","c","t","c","c","a","g","c","g","c",
"g","g","c","g","t","t","g","g","c","t","C","c","c","t","c","g","C","c",
"c","T","t","g","a","c","a","T","t","g","g","c","g","a","c","c","c","c",
"t","g","c","c","t","c","a","a","c","c","c","a","t","c","c","c","C","c",
"g","t","t","a","a","g")
我可以将这些载体翻译成氨基酸:
t_ref=translate(ref)
t_alt=translate(alt)
然后我可以比较它们并说出哪些改变了:
which((ref==alt)==FALSE)
which((t_ref==t_alt)==FALSE)
所以问题是在test
df中标记核苷酸会导致氨基酸改变。提前谢谢。
答案 0 :(得分:2)
使用模运算从核苷酸序列的pos
列构建蛋白质序列中的位置
library(seqinr)
test$pos %/% 3 # returns a zero-based position, so add 1 to get 1 based value
#[1] 0 2 21 22 25 26
t_ref[ 1+(test$pos %/% 3)]
#[1] "M" "S" "G" "A" "R" "A" # lookup value in prot-seq
t_alt[ 1+(test$pos %/% 3)]
#[1] "R" "W" "A" "A" "L" "A" # test for equality to this value
test$change <- t_ref[ 1+((test$pos-1) %/% 3)] == t_alt[ 1+((test$pos-1) %/% 3)]
test
#=====================
pos ALT REF change
1 2 G T FALSE
2 8 G T FALSE
3 65 C G FALSE
4 68 C G TRUE
5 77 T C FALSE
6 78 G C FALSE
我得到了#34;注册&#34;在我的第一次尝试中模数算术错了,注意到这是一个正确的&#34;注册&#34;翻译:
> (1:21 -1) %/% 3
[1] 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6