打印一行HTML,保持正确的格式

时间:2016-05-15 09:45:34

标签: python beautifulsoup

我必须打印此HTML page的所有原始文本。

每一行都有这种格式:

ENSG00000001461'&nbsp';'&nbsp';'&nbsp';'&nbsp';ENST00000432012'&nbsp';'&nbsp';'&nbsp';'&nbsp';NIPAL3'&nbsp';'&nbsp';'&nbsp';'&nbsp';5'&nbsp';'&nbsp';'&nbsp';'&nbsp';1'&nbsp';'&nbsp';'&nbsp';'&nbsp';Forward'&nbsp';'&nbsp';'&nbsp';'&nbsp';NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]<'br/'>

我想要以下输出:

ENSG00000001461    ENST00000432012    NIPAL3    5    1    Forward    NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]

但输出只是:

ENSG00000001461 

这是我的代码:

import urllib
from bs4 import BeautifulSoup
species = ['HomoSapiens', 'MusMusculus', 'DrosophilaMelanogaster','CaenorhabditisElegans']
rna_target = ['mRNA', 'lincRNA', 'lncRNA']
db = ['MB21E78v2', 'MB19E65v2', 'MB16E62v1']

species_input = input("Selezionare Specie: ")
target_input = input("Selezionare tipo di RNA: ")
db_input = input("Selezionare DataBase: ")
check = 0

for i in range(len(species)):
    if species_input == species[i]:
        for j in range(len(rna_target)):
            if target_input == rna_target[j]:
                for k in range(len(db)):
                    if db_input == db[k]:
                        check = 1
if check == 1:
    print("Dati Inseriti Correttamente!")
else:
    print("Error: Dati inseriti in modo errato!")
    exit()

url =   urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/OptionController?>" +"species=" + species_input + "&type=" + target_input + "&version=" +db_input)
print(url.geturl())

identifier = []
seq_input = input("Digitare ID miRNA: ")
seq = ""
seq = seq_input.split()
print(seq)

for i in range(len(seq)):
    identifier.append(seq[i] + "%20")
s = ""
string = s.join(identifier)

url_tab = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/InputController?>"+"identifier=" string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&"+"version=" + db_input + "&species=" + species_input + "&type=" + target_input)
print(url_tab.geturl())

download = urllib.request.urlopen("
<http://cm.jefferson.edu/rna22/Precomputed/InputController?>download=ALL"+"&ident=" + string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&" +"version=" + db_input + "&species=" + species_input + "&type=" + target_input)
down_string = download.geturl()
print(down_string)
soup = BeautifulSoup(download, "html5lib")
for match in soup.findAll('br'):
    match.unwrap()
s2 = soup
s1 = s2.body.extract()
print(s1.prettify(formatter=lambda s: s.strip(u'xa0')))

2 个答案:

答案 0 :(得分:1)

源中没有线条的概念,只需要使用br标签使用换行符分隔一行长文本。

如果你必须解析源代码,你可以用换行符替换br标签,然后拉动文本:

import  requests
from bs4 import BeautifulSoup

r = requests.get("https://cm.jefferson.edu/rna22/Precomputed/InputController?download=ALL&ident=hsa_miR_107%20hsa_miR_5011_5p%20hsa_miR_326&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&version=MB21E78v2&species=HomoSapiens&type=mRNA")

soup =  BeautifulSoup(r.content)
for b in soup.find_all("br"):
    b.replace_with("\n")
print(soup.text)

哪个会给你:

ENSG00000001461    ENST00000432012    NIPAL3    5    1    Forward    NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]
ENSG00000001631    ENST00000340022    KRIT1    5    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000394503    KRIT1    3    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000394505    KRIT1    3    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000394507    KRIT1    4    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631    ENST00000412043    KRIT1    4    7    Reverse    KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000002834    ENST00000318008    LASP1    6    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000002834    ENST00000433206    LASP1    6    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000002834    ENST00000435347    LASP1    5    17    Forward    LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000005381    ENST00000225275    MPO    5    17    Reverse    myeloperoxidase [Source:HGNC Symbol;Acc:HGNC:7218]
ENSG00000005889    ENST00000539115    ZFX    4    23 X    Forward    zinc finger protein, X-linked [Source:HGNC Symbol;Acc:HGNC:12869]
ENSG00000006432    ENST00000554752    MAP3K9    10    14    Reverse    mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861]
ENSG00000006432    ENST00000611979    MAP3K9    10    14    Reverse    mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861]
ENSG00000007216    ENST00000314669    SLC13A2    4    17    Forward    solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917]
ENSG00000007216    ENST00000444914    SLC13A2    4    17    Forward    solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917]

还有更多相同的东西。

答案 1 :(得分:-1)

我测试了你的代码并取代了之前的答案。

如果您编辑以下错误,您的代码似乎可以正常工作。

  • 删除&lt;来自网址
  • 第42行删除EOL
  • 在“identifiers =”和字符串
  • 之间添加一个+

以下是我得到的输出的一些行:

 ENSG00000272325    ENST00000607016    NUDT3    4    6    Reverse    nudix (nucleoside diphosphate linked moiety X)-type motif 3 [Source:HGNC Symbol;Acc:HGNC:8050]
 ENSG00000272980    ENST00000400926    CCR6    5    6    Forward    chemokine (C-C motif) receptor 6 [Source:HGNC Symbol;Acc:HGNC:1607]
 ENSG00000274211    ENST00000612932    SOCS7    8    17    Forward    suppressor of cytokine signaling 7 [Source:HGNC Symbol;Acc:HGNC:29846]
 ENSG00000274588    ENST00000611977    DGKK    4    23 X    Reverse    diacylglycerol kinase, kappa [Source:HGNC Symbol;Acc:HGNC:32395]
 ENSG00000275004    ENST00000613655    ZNF280B    4    22    Reverse    zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022]
 ENSG00000275004    ENST00000619852    ZNF280B    4    22    Reverse    zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022]
 ENSG00000275832    ENST00000622683    ARHGAP23    6    17    Forward    Rho GTPase activating protein 23 [Source:HGNC Symbol;Acc:HGNC:29293]
 ENSG00000277258    ENST00000616199    PCGF2    3    17    Reverse    polycomb group ring finger 2 [Source:HGNC Symbol;Acc:HGNC:12929]
 ENSG00000278871    ENST00000623344    KDM5D    8    24 Y    Reverse    lysine (K)-specific demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115]
 ENSG00000279096    ENST00000622918    AL356289.1    11    1    Forward    HCG1780467 {ECO:0000313|EMBL:EAX06861.1}; PRO0529 {ECO:0000313|EMBL:AAF16687.1}  [Source:UniProtKB/TrEMBL;Acc:Q9UI23]