我正在尝试使用此代码使用Python解析PRE标签
s = br.open(base_url+str(string))
u = br.geturl()
seq = br.open(u)
blat = BeautifulSoup(seq)
for res in blat.find('pre').findChildren():
seq = res.string
print seq
来自以下HTML源代码:
<PRE><TT>
<span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>aa<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span></TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=genomic></A>Genomic chr17 (reverse strand):</H4>
<PRE><TT>
tacatttttc tctaactgca aacataatgt tttcccttgt attttacaga 41256278
tgcaaacagc tataattttg caaaaaagga aaataactct cctgaacatc 41256228
<A NAME=1></A><span style='color:#22CCEE;'>T</span><span style='color:#3300FF;'>AAAAGATGA</span> <span style='color:#3300FF;'>AGTTTCTATC</span> <span style='color:#3300FF;'>ATCCAAA</span>gt<span style='color:#3300FF;'>A</span> <span style='color:#3300FF;'>TGGGCTACAG</span> <span style='color:#3300FF;'>AAAC</span><span style='color:#22CCEE;'>C</span>gtgcc 41256178
aaaagacttc tacagagtga acccgaaaat ccttccttgg taaaaccatt 41256128
tgttttcttc ttcttcttct tcttcttttc tttttttttt ctttt</TT></PRE>
<HR ALIGN="CENTER"><H4><A NAME=ali></A>Side by Side Alignment</H4>
<PRE><TT>
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
</TT></PRE>
当我想解析最后一个时,它给了我第一个PRE标签元素。我很欣赏任何实现它的建议。 我希望输出如下:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
而我当前的输出是
T
AAAAGATGA
AGTTTCTATC
ATCCAAA
A
TGGGCTACAG
AAAC
C
答案 0 :(得分:3)
您可以使用find_all()
获取最后的结果:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('../index.html'), 'html5lib')
pre = soup.find_all('pre')[-1]
print pre.text.strip()
其中index.html
包含您提供的html。
打印:
00000001 taaaagatgaagtttctatcatccaaaaaatgggctacagaaacc 00000045
<<<<<<<< ||||||||||||||||||||||||||| |||||||||||||||| <<<<<<<<
41256227 taaaagatgaagtttctatcatccaaagtatgggctacagaaacc 41256183
另一种选择是依靠之前的h4
代码获取相应的pre
:
h4 = soup.select('h4 > a[name="ali"]')[0].parent
print h4.find_next_sibling('pre').text.strip()