如何从字符串中删除不需要的字符,如空格和换行符

时间:2013-12-17 11:42:01

标签: python

最近我写了一个python脚本来从数据库下载序列,如果你提供了入藏号(例如Rv1617),它会给出输出,如下所示

import wget
import re
from HTMLParser import HTMLParser
e = raw_input("Enter the correct accession number.: ")
y = ''.join([i for i in e if i.isdigit()])
#print y
url = "http://tuberculist.epfl.ch/quicksearch.php?gene+name="+y+"&submit=Search#sequence"
#print url
filname = wget.download(url)
a = open(filname,'r')
b = a.readlines()
f = "|"+e+"|"

for c in b:
    if f in c:
        #x = c
        pattern = re.compile("> >.+<br /></")
        z = pattern.findall(c)
        #print z

class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        final = ''.join(data)
        andfinal = final.replace(" ","")
        print andfinal,
# instantiate the parser and fed it some HTML

for xz in z:
    parser = MyHTMLParser()
    parser.feed(xz)

它将下载如下序列:

>>>
Enter the correct accession number.:Rv1617

>>M.tuberculosisH37Rv|Rv1617|pykA
VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGR
AVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGD
RVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFA
LNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMV
ARGDLGVELPLEEVPLVQKRAIQMARENAKPVIVATQMLDSMIENSRPTRAEASDVANAV
LDGADALMLSGETSVGKYPLAAVRTMSRIICAVEENSTAAPPLTHIPRTKRGVISYAARD
IGERLDAKALVAFTQSGDTVRRLARLHTPLPLLAFTAWPEVRSQLAMTWGTETFIVPKMQ
STDGMIRQVDKSLLELARYKRGDLVVIVAGAPPGTVGSTNLIHVHRIGEDDV

第一行很好,但其余的行都有新的行字符或空格,应该在输出中删除,输出应该如下所示:

 >>M.tuberculosisH37Rv|Rv1617|pykA
VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGRAVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGDRVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFALNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMV

我尝试过:

andfinal = final.replace(" ","")

它适用于空格,但不适用于换行符。

请建议我应该做出哪些更改:

感谢和问候

3 个答案:

答案 0 :(得分:1)

在换行符上拆分字符串,然后重新加入这些行:

final_lines = final.splitlines()
final = final_lines[0] + '\n' + ''.join(final_lines[1:])

演示:

>>> final = '''\
... >M. tuberculosis H37Rv|Rv1617|pykA
... VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGR
... AVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGD
... RVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFA
... LNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMV
... ARGDLGVELPLEEVPLVQKRAIQMARENAKPVIVATQMLDSMIENSRPTRAEASDVANAV
... LDGADALMLSGETSVGKYPLAAVRTMSRIICAVEENSTAAPPLTHIPRTKRGVISYAARD
... IGERLDAKALVAFTQSGDTVRRLARLHTPLPLLAFTAWPEVRSQLAMTWGTETFIVPKMQ
... STDGMIRQVDKSLLELARYKRGDLVVIVAGAPPGTVGSTNLIHVHRIGEDDV
... '''
>>> final_lines = final.splitlines()
>>> print final_lines[0] + '\n' + ''.join(final_lines[1:])
>M. tuberculosis H37Rv|Rv1617|pykA
VTRRGKIVCTLGPATQRDDLVRALVEAGMDVARMNFSHGDYDDHKVAYERVRVASDATGRAVGVLADLQGPKIRLGRFASGATHWAEGETVRITVGACEGSHDRVSTTYKRLAQDAVAGDRVLVDDGKVALVVDAVEGDDVVCTVVEGGPVSDNKGISLPGMNVTAPALSEKDIEDLTFALNLGVDMVALSFVRSPADVELVHEVMDRIGRRVPVIAKLEKPEAIDNLEAIVLAFDAVMVARGDLGVELPLEEVPLVQKRAIQMARENAKPVIVATQMLDSMIENSRPTRAEASDVANAVLDGADALMLSGETSVGKYPLAAVRTMSRIICAVEENSTAAPPLTHIPRTKRGVISYAARDIGERLDAKALVAFTQSGDTVRRLARLHTPLPLLAFTAWPEVRSQLAMTWGTETFIVPKMQSTDGMIRQVDKSLLELARYKRGDLVVIVAGAPPGTVGSTNLIHVHRIGEDDV

然而,考虑到FAST格式特别允许换行,并且一个像样的FASTA格式库可以为你解释字符串。

答案 1 :(得分:0)

或者您可以replace两次:

w = s.replace('\n', '').replace(' ', '')

这也将为您提供单行输出。

答案 2 :(得分:0)

s.strip()就是你要找的。

如果没有提供参数,它会删除所有空格字符,包括换行符。

一次只读一行,剥离并加入。