我必须打印此HTML page的所有原始文本。
每一行都有这种格式:
ENSG00000001461' ';' ';' ';' ';ENST00000432012' ';' ';' ';' ';NIPAL3' ';' ';' ';' ';5' ';' ';' ';' ';1' ';' ';' ';' ';Forward' ';' ';' ';' ';NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]<'br/'>
我想要以下输出:
ENSG00000001461 ENST00000432012 NIPAL3 5 1 Forward NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]
但输出只是:
ENSG00000001461
这是我的代码:
import urllib
from bs4 import BeautifulSoup
species = ['HomoSapiens', 'MusMusculus', 'DrosophilaMelanogaster','CaenorhabditisElegans']
rna_target = ['mRNA', 'lincRNA', 'lncRNA']
db = ['MB21E78v2', 'MB19E65v2', 'MB16E62v1']
species_input = input("Selezionare Specie: ")
target_input = input("Selezionare tipo di RNA: ")
db_input = input("Selezionare DataBase: ")
check = 0
for i in range(len(species)):
if species_input == species[i]:
for j in range(len(rna_target)):
if target_input == rna_target[j]:
for k in range(len(db)):
if db_input == db[k]:
check = 1
if check == 1:
print("Dati Inseriti Correttamente!")
else:
print("Error: Dati inseriti in modo errato!")
exit()
url = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/OptionController?>" +"species=" + species_input + "&type=" + target_input + "&version=" +db_input)
print(url.geturl())
identifier = []
seq_input = input("Digitare ID miRNA: ")
seq = ""
seq = seq_input.split()
print(seq)
for i in range(len(seq)):
identifier.append(seq[i] + "%20")
s = ""
string = s.join(identifier)
url_tab = urllib.request.urlopen("<https://cm.jefferson.edu/rna22/Precomputed/InputController?>"+"identifier=" string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&"+"version=" + db_input + "&species=" + species_input + "&type=" + target_input)
print(url_tab.geturl())
download = urllib.request.urlopen("
<http://cm.jefferson.edu/rna22/Precomputed/InputController?>download=ALL"+"&ident=" + string+"&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&" +"version=" + db_input + "&species=" + species_input + "&type=" + target_input)
down_string = download.geturl()
print(down_string)
soup = BeautifulSoup(download, "html5lib")
for match in soup.findAll('br'):
match.unwrap()
s2 = soup
s1 = s2.body.extract()
print(s1.prettify(formatter=lambda s: s.strip(u'xa0')))
答案 0 :(得分:1)
源中没有线条的概念,只需要使用br标签使用换行符分隔一行长文本。
如果你必须解析源代码,你可以用换行符替换br标签,然后拉动文本:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://cm.jefferson.edu/rna22/Precomputed/InputController?download=ALL&ident=hsa_miR_107%20hsa_miR_5011_5p%20hsa_miR_326&minBasePairs=12&maxFoldingEnergy=-12&minSumHits=1&maxProb=.1&version=MB21E78v2&species=HomoSapiens&type=mRNA")
soup = BeautifulSoup(r.content)
for b in soup.find_all("br"):
b.replace_with("\n")
print(soup.text)
哪个会给你:
ENSG00000001461 ENST00000432012 NIPAL3 5 1 Forward NIPA-like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]
ENSG00000001631 ENST00000340022 KRIT1 5 7 Reverse KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631 ENST00000394503 KRIT1 3 7 Reverse KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631 ENST00000394505 KRIT1 3 7 Reverse KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631 ENST00000394507 KRIT1 4 7 Reverse KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000001631 ENST00000412043 KRIT1 4 7 Reverse KRIT1, ankyrin repeat containing [Source:HGNC Symbol;Acc:HGNC:1573]
ENSG00000002834 ENST00000318008 LASP1 6 17 Forward LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000002834 ENST00000433206 LASP1 6 17 Forward LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000002834 ENST00000435347 LASP1 5 17 Forward LIM and SH3 protein 1 [Source:HGNC Symbol;Acc:HGNC:6513]
ENSG00000005381 ENST00000225275 MPO 5 17 Reverse myeloperoxidase [Source:HGNC Symbol;Acc:HGNC:7218]
ENSG00000005889 ENST00000539115 ZFX 4 23 X Forward zinc finger protein, X-linked [Source:HGNC Symbol;Acc:HGNC:12869]
ENSG00000006432 ENST00000554752 MAP3K9 10 14 Reverse mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861]
ENSG00000006432 ENST00000611979 MAP3K9 10 14 Reverse mitogen-activated protein kinase kinase kinase 9 [Source:HGNC Symbol;Acc:HGNC:6861]
ENSG00000007216 ENST00000314669 SLC13A2 4 17 Forward solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917]
ENSG00000007216 ENST00000444914 SLC13A2 4 17 Forward solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 2 [Source:HGNC Symbol;Acc:HGNC:10917]
还有更多相同的东西。
答案 1 :(得分:-1)
我测试了你的代码并取代了之前的答案。
如果您编辑以下错误,您的代码似乎可以正常工作。
以下是我得到的输出的一些行:
ENSG00000272325 ENST00000607016 NUDT3 4 6 Reverse nudix (nucleoside diphosphate linked moiety X)-type motif 3 [Source:HGNC Symbol;Acc:HGNC:8050]
ENSG00000272980 ENST00000400926 CCR6 5 6 Forward chemokine (C-C motif) receptor 6 [Source:HGNC Symbol;Acc:HGNC:1607]
ENSG00000274211 ENST00000612932 SOCS7 8 17 Forward suppressor of cytokine signaling 7 [Source:HGNC Symbol;Acc:HGNC:29846]
ENSG00000274588 ENST00000611977 DGKK 4 23 X Reverse diacylglycerol kinase, kappa [Source:HGNC Symbol;Acc:HGNC:32395]
ENSG00000275004 ENST00000613655 ZNF280B 4 22 Reverse zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022]
ENSG00000275004 ENST00000619852 ZNF280B 4 22 Reverse zinc finger protein 280B [Source:HGNC Symbol;Acc:HGNC:23022]
ENSG00000275832 ENST00000622683 ARHGAP23 6 17 Forward Rho GTPase activating protein 23 [Source:HGNC Symbol;Acc:HGNC:29293]
ENSG00000277258 ENST00000616199 PCGF2 3 17 Reverse polycomb group ring finger 2 [Source:HGNC Symbol;Acc:HGNC:12929]
ENSG00000278871 ENST00000623344 KDM5D 8 24 Y Reverse lysine (K)-specific demethylase 5D [Source:HGNC Symbol;Acc:HGNC:11115]
ENSG00000279096 ENST00000622918 AL356289.1 11 1 Forward HCG1780467 {ECO:0000313|EMBL:EAX06861.1}; PRO0529 {ECO:0000313|EMBL:AAF16687.1} [Source:UniProtKB/TrEMBL;Acc:Q9UI23]