最近我写了一个python脚本来从数据库下载序列,如果你提供了入藏号(例如Rv1617),它会给出输出,如下所示
import wget
import re
from HTMLParser import HTMLParser
e = raw_input("Enter the correct accession number.: ")
y = ''.join([i for i in e if i.isdigit()])
#print y
url = "http://tuberculist.epfl.ch/quicksearch.php?gene+name="+y+"&submit=Search#sequence"
#print url
filname = wget.download(url)
a = open(filname,'r')
b = a.readlines()
f = "|"+e+"|"
for c in b:
if f in c:
#x = c
pattern = re.compile("> >.+<br /></")
z = pattern.findall(c)
#print z
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
final = ''.join(data)
andfinal = final.replace(" ","")
print andfinal,
# instantiate the parser and fed it some HTML
for xz in z:
parser = MyHTMLParser()
parser.feed(xz)
它将下载如下序列:
>>>
Enter the correct accession number.:Rv1617
>>M.tuberculosisH37Rv|Rv1617|pykA
VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGR
AVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGD
RVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFA
LNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMV
ARGDLGVELPLEEVPLVQKRAIQMARENAKPVIVATQMLDSMIENSRPTRAEASDVANAV
LDGADALMLSGETSVGKYPLAAVRTMSRIICAVEENSTAAPPLTHIPRTKRGVISYAARD
IGERLDAKALVAFTQSGDTVRRLARLHTPLPLLAFTAWPEVRSQLAMTWGTETFIVPKMQ
STDGMIRQVDKSLLELARYKRGDLVVIVAGAPPGTVGSTNLIHVHRIGEDDV
第一行很好,但其余的行都有新的行字符或空格,应该在输出中删除,输出应该如下所示:
>>M.tuberculosisH37Rv|Rv1617|pykA
VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGRAVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGDRVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFALNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMV
我尝试过:
andfinal = final.replace(" ","")
它适用于空格,但不适用于换行符。
请建议我应该做出哪些更改:
感谢和问候
答案 0 :(得分:1)
在换行符上拆分字符串,然后重新加入这些行:
final_lines = final.splitlines()
final = final_lines[0] + '\n' + ''.join(final_lines[1:])
演示:
>>> final = '''\
... >M. tuberculosis H37Rv|Rv1617|pykA
... VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGR
... AVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGD
... RVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFA
... LNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMV
... ARGDLGVELPLEEVPLVQKRAIQMARENAKPVIVATQMLDSMIENSRPTRAEASDVANAV
... LDGADALMLSGETSVGKYPLAAVRTMSRIICAVEENSTAAPPLTHIPRTKRGVISYAARD
... IGERLDAKALVAFTQSGDTVRRLARLHTPLPLLAFTAWPEVRSQLAMTWGTETFIVPKMQ
... STDGMIRQVDKSLLELARYKRGDLVVIVAGAPPGTVGSTNLIHVHRIGEDDV
... '''
>>> final_lines = final.splitlines()
>>> print final_lines[0] + '\n' + ''.join(final_lines[1:])
>M. tuberculosis H37Rv|Rv1617|pykA
VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGRAVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGDRVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFALNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMVARGDLGVELPLEEVPLVQKRAIQMARENAKPVIVATQMLDSMIENSRPTRAEASDVANAVLDGADALMLSGETSVGKYPLAAVRTMSRIICAVEENSTAAPPLTHIPRTKRGVISYAARDIGERLDAKALVAFTQSGDTVRRLARLHTPLPLLAFTAWPEVRSQLAMTWGTETFIVPKMQSTDGMIRQVDKSLLELARYKRGDLVVIVAGAPPGTVGSTNLIHVHRIGEDDV
然而,考虑到FAST格式特别允许换行,并且一个像样的FASTA格式库可以为你解释字符串。
答案 1 :(得分:0)
或者您可以replace
两次:
w = s.replace('\n', '').replace(' ', '')
这也将为您提供单行输出。
答案 2 :(得分:0)
s.strip()就是你要找的。
如果没有提供参数,它会删除所有空格字符,包括换行符。
一次只读一行,剥离并加入。