我有一个像这样的excel文件..
Sr. No. GENE ID Gene Id (NCBI) Protein Id Protein Sequences
1 Lmo0001 984365 NP_463534.1
2 Lmo0002 984379 NP_463535.1
3 Lmo0003 984420 NP_463536.1
此列表扩展到3000个基因。我将序列保存在这样的文本板中,这些序列适用于所有3000个基因,每个序列之间有一个空格。
GI | 16802049 | REF | NP_463534.1 |染色体复制起始蛋白[单核细胞增生李斯特菌EGD-e] MQSIEDIWQETLQIVKKNMSKPSYDTWMKSTTAHSLEGNTFIISAPNNFVRDWLEKSYTQFIANILQEIT GRLFDVRFIDGEQEENFEYTVIKPNPALDEDGIEIGKHMLNPRYVFDTFVIGSGNRFAHAASLAVAEAPA KAYNPLFIYGGVGLGKTHLMHAVGHYVQQHKDNAKVMYLSSEKFTNEFISSIRDNKTEEFRTKYRNVDVL LIDDIQFLAGKEGTQEEFFHTFNTLYDEQKQIIISSDRPPKEIPTLEDRLRSRFEWGLITDITPPDLETR IAILRKKAKADGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLVNKDITAGLAAEALKDIIPSSKS QVITISGIQEAVGEYFHVRLEDFKAKKRTKSIAFPRQIAMYLSRELTDASLPKIGDEFGGRDHTTVIHAH EKISQLLKTDQVLKNDLAEIEKNLRKAQNMF
GI | 16802050 | REF | NP_463535.1 | DNA聚合酶III亚基β[单核细胞增生李斯特菌EGD-e] MKFVIERDRLVQAVNEVTRAISARTTIPILTGIKIVVNDEGVTLTGSDSDISIEAFIPLIENDEVIVEVE SFGGIVLQSKYFGDIVRRLPEENVEIEVTSNYQTNISSGQASFTLNGLDPMEYPKLPEVTDGKTIKIPIN VLKNIVRQTVFAVSAIEVRPVLTGVNWIIKENKLSAVATDSHRLALREIPLETDIDEEYNIVIPGKSLSE LNKLLDDASESIEMTLANNQILFKLKDLLFYSRLLEGSYPDTSRLIPTDTKSELVINSKAFLQAIDRASL LARENRNNVIKLMTLENGQVEVSSNSPEVGNVSENVFSQSFTGEEIKISFNGKYMMDALRAFEGDDIQIS FSGTMRPFVLRPKDAANPNEILQLITPVRTY
GI | 16802051 | REF | NP_463536.1 |假设蛋白质lmo0003 [单核细胞增生李斯特菌EGD-e] MMKDMTTGNPTKLIFLFAMPMLIGNLFQQFYTMIDAVIVGKFVSVDALAAVGATNSVNFFMISLIIGLMS GISVVVAQYFGFKDYDRLKDVIATATYAVVFSAIILTVAGVLLAKPLLILLRTPANILDDSTIFLTTLFI GILPMSLYNGMAAILRALGNSITPLIFLILSSLMNIALDFLFVVYMDMGVRGAAIATVLSQTAAAIAVIY YAYRHVPFMRIERAKFKLSTPLLKEMVRIGLPSGLQGSFISIGNMALQSLINGFGSSVVAAYTAASRIDS LTYQPGIAFGAASSMFAGQNIGAGKIDRVREGFWSGIKVVTAISIGITILVQLFARQFLLLFVDSSETEV INIGVSYLLIVSLFYVVVGILFVVRETLRGTGDAMVPLAMGIFELVSRLVIGFVLSLYIGYVGLWWATPV AWITATILGVWRYKSGAWQKKAVIRRK
GI | 16802052 | REF | NP_463537.1 |假设蛋白质lmo0004 [单核细胞增生李斯特菌EGD-e] MAETVKINSEFVTLGQLLQMIDVVSTGGMAKAYLSENTIYINGEQDNRRGKKLRNGDVILVPGVGKVKIE QGK
GI | 16802053 | REF | NP_463538.1 |重组蛋白F [单核细胞增生李斯特菌EGD-e] MHLESIVLRNFRNYENLELEFSPSVNVFLGENAQGKTNLLEAVLMLALAKSHRTTNDKDFIMWEKEEAKM EGRIAKHGQSVPLELAITQKGKRAKVNHLEQKKLSQYVGNLNVVIFAPEDLSLVKGAPGIRRRFLNMEIG QMQPIYLHNLSEYQRILQQRNQYLKMLQMKRKVDPILLDILTEQFADVAINLTKRRADFIQKLEAYAAPI HHQISRGLETLKIEYKASITLNGDDPEVWKADLLQKMESIKQREIDRGVTLIGPHRDDSLFYINGQNVQD FGSQGQQRTTALSIKLAEIDLIHEETGEYPVLLLDDVLSELDDYRQSHLLGAIEGKVQTFVTTTSTSGID HETLKQATTFYVEKGTVKKS
是否可以将每个序列点中的每个序列放在每一行上,而无需手动复制和粘贴每个序列?任何方法都没问题。
P.S我很抱歉这张荒谬的桌子,但没有足够的声望点,我无法发布图片,这是我能做到的最好的。@swapnil但是我想在第一张excel表中的蛋白质序列列下直接从记事本中复制序列。
答案 0 :(得分:0)
嗯,这里不会是简单的复制/粘贴。我认为你可以做的是将所有内容复制到一个新的Excel工作表中,并使用分隔符管|
对文本执行操作以获得最后一点:
chromosomal replication initiation protein [Listeria monocytogenes EGD-e] MQSIEDIWQETLQIVKKNMSKPSYDTWMKSTTAHSLEGNTFIISAPNNFVRDWLEKSYTQFIANILQEIT GRLFDVRFIDGEQEENFEYTVIKPNPALDEDGIEIGKHMLNPRYVFDTFVIGSGNRFAHAASLAVAEAPA KAYNPLFIYGGVGLGKTHLMHAVGHYVQQHKDNAKVMYLSSEKFTNEFISSIRDNKTEEFRTKYRNVDVL LIDDIQFLAGKEGTQEEFFHTFNTLYDEQKQIIISSDRPPKEIPTLEDRLRSRFEWGLITDITPPDLETR IAILRKKAKADGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLVNKDITAGLAAEALKDIIPSSKS QVITISGIQEAVGEYFHVRLEDFKAKKRTKSIAFPRQIAMYLSRELTDASLPKIGDEFGGRDHTTVIHAH EKISQLLKTDQVLKNDLAEIEKNLRKAQNMF
DNA polymerase III subunit beta [Listeria monocytogenes EGD-e] MKFVIERDRLVQAVNEVTRAISARTTIPILTGIKIVVNDEGVTLTGSDSDISIEAFIPLIENDEVIVEVE SFGGIVLQSKYFGDIVRRLPEENVEIEVTSNYQTNISSGQASFTLNGLDPMEYPKLPEVTDGKTIKIPIN VLKNIVRQTVFAVSAIEVRPVLTGVNWIIKENKLSAVATDSHRLALREIPLETDIDEEYNIVIPGKSLSE LNKLLDDASESIEMTLANNQILFKLKDLLFYSRLLEGSYPDTSRLIPTDTKSELVINSKAFLQAIDRASL LARENRNNVIKLMTLENGQVEVSSNSPEVGNVSENVFSQSFTGEEIKISFNGKYMMDALRAFEGDDIQIS FSGTMRPFVLRPKDAANPNEILQLITPVRTY
hypothetical protein lmo0003 [Listeria monocytogenes EGD-e] MMKDMTTGNPTKLIFLFAMPMLIGNLFQQFYTMIDAVIVGKFVSVDALAAVGATNSVNFFMISLIIGLMS GISVVVAQYFGFKDYDRLKDVIATATYAVVFSAIILTVAGVLLAKPLLILLRTPANILDDSTIFLTTLFI GILPMSLYNGMAAILRALGNSITPLIFLILSSLMNIALDFLFVVYMDMGVRGAAIATVLSQTAAAIAVIY YAYRHVPFMRIERAKFKLSTPLLKEMVRIGLPSGLQGSFISIGNMALQSLINGFGSSVVAAYTAASRIDS LTYQPGIAFGAASSMFAGQNIGAGKIDRVREGFWSGIKVVTAISIGITILVQLFARQFLLLFVDSSETEV INIGVSYLLIVSLFYVVVGILFVVRETLRGTGDAMVPLAMGIFELVSRLVIGFVLSLYIGYVGLWWATPV AWITATILGVWRYKSGAWQKKAVIRRK
hypothetical protein lmo0004 [Listeria monocytogenes EGD-e] MAETVKINSEFVTLGQLLQMIDVVSTGGMAKAYLSENTIYINGEQDNRRGKKLRNGDVILVPGVGKVKIE QGK
recombination protein F [Listeria monocytogenes EGD-e] MHLESIVLRNFRNYENLELEFSPSVNVFLGENAQGKTNLLEAVLMLALAKSHRTTNDKDFIMWEKEEAKM EGRIAKHGQSVPLELAITQKGKRAKVNHLEQKKLSQYVGNLNVVIFAPEDLSLVKGAPGIRRRFLNMEIG QMQPIYLHNLSEYQRILQQRNQYLKMLQMKRKVDPILLDILTEQFADVAINLTKRRADFIQKLEAYAAPI HHQISRGLETLKIEYKASITLNGDDPEVWKADLLQKMESIKQREIDRGVTLIGPHRDDSLFYINGQNVQD FGSQGQQRTTALSIKLAEIDLIHEETGEYPVLLLDDVLSELDDYRQSHLLGAIEGKVQTFVTTTSTSGID HETLKQATTFYVEKGTVKKS
这应该转到E栏。然后在F栏中,您可以输入公式:
=mid(E1, find("]", E1)+2, len(E1))
这将在结束方括号]
之后提取所有内容,从而返回所需的序列。
假设这些文件位于excel文件工作簿中的名称为Sheet2的工作表中(其中第一个工作表包含您现在拥有的表格)。
在第一张表中,输入公式:
=vlookup(D2, Sheet2!D:F, 3, 0)
这假设您的文本文件与表中列出的Protein Ids的顺序不同。否则,您只需将F列结果的值(复制和粘贴特殊,选择粘贴值)复制/粘贴到第一张表中,
答案 1 :(得分:0)
感谢您的回答。我实际上是使用正则表达式\ n ^ [a-z]在textpad上编辑它,然后将其复制到excel。所以这就解决了。谢谢了。我从另一个堆栈溢出问题中得到了这个建议。