我是蟒蛇和美丽汤的新手,但我得到了像
这样的页面<div class='pid-details'><p>
<span>Drug:</span> <a href='/search.php?searchterm=amantadine&referer=pillid'>Amantadine Hydrochloride</a><br />
<span>Strength:</span> 100 mg<br/>
<span>Pill Imprint:</span> <a href='/imprints/c-122-6021.html'>C-122</a><br /><span>Color:</span> Yellow<br /><span>Shape:</span> Capsule-shape</p>
<a class='input-button small' href='/imprints/c-122-6021.html'>View Images & Details</a>
<a class='input-button input-button-outline-grey small' href='/imprints/c-122-6021.html?printable=1' rel='nofollow' target='_blank'><i class='icon icon-print'></i>Print</a>
</div>
我的目标是提取标签内的价值
<a href='/search.php?searchterm=amantadine&referer=pillid'>Amantadine Hydrochloride</a>
所以结果应该是
"Amantadine Hydrochloride"
请指导我,让我开始爬行。提前致谢
答案 0 :(得分:0)
我认为这就是你想要的。此代码返回带有内部标记的列表(已找到)
page = '<div class=\'pid-details\'><p>\
\<span>Drug:</span> <a href=\'/search.php?searchterm=amantadine&referer=pillid\'>Amantadine Hydrochloride</a><br />\
<span>Strength:</span> 100 mg<br/>\
<span>Pill Imprint:</span> <a href=\'/imprints/c-122-6021.html\'>C-122</a><br /><span>Color:</span> Yellow<br /><span>Shape:</span> Capsule-shape</p>\
<a class=\'input-button small\' href=\'/imprints/c-122-6021.html\'>View Images & Details</a>\
<a class=\'input-button input-button-outline-grey small\' href=\'/imprints/c-122-6021.html?printable=1\' rel=\'nofollow\' target=\'_blank\'><i class=\'icon icon-print\'>\
</i>Print</a>\
</div>'
soup = BeautifulSoup(page,'html.parser')
found = []
hrefs = soup.find_all('a')
p = re.compile('<a href.*>(.*)</a>', re.IGNORECASE)
for h in hrefs:
m = re.search(p,str(h))
if m:
found.append(m.group(1))
found