尝试使用美丽的汤从html页面中提取价值

时间:2017-08-02 08:29:34

标签: python html beautifulsoup tags

我是蟒蛇和美丽汤的新手,但我得到了像

这样的页面
<div class='pid-details'><p>
  <span>Drug:</span> <a href='/search.php?searchterm=amantadine&amp;referer=pillid'>Amantadine Hydrochloride</a><br />
  <span>Strength:</span> 100 mg<br/>
  <span>Pill Imprint:</span> <a href='/imprints/c-122-6021.html'>C-122</a><br /><span>Color:</span> Yellow<br /><span>Shape:</span> Capsule-shape</p>
  <a class='input-button small' href='/imprints/c-122-6021.html'>View Images &amp; Details</a>
  <a class='input-button input-button-outline-grey small' href='/imprints/c-122-6021.html?printable=1' rel='nofollow' target='_blank'><i class='icon icon-print'></i>Print</a>
</div>

我的目标是提取标签内的价值

<a href='/search.php?searchterm=amantadine&amp;referer=pillid'>Amantadine Hydrochloride</a>

所以结果应该是

"Amantadine Hydrochloride"

请指导我,让我开始爬行。提前致谢

1 个答案:

答案 0 :(得分:0)

我认为这就是你想要的。此代码返回带有内部标记的列表(已找到)

        page = '<div class=\'pid-details\'><p>\
                  \<span>Drug:</span> <a href=\'/search.php?searchterm=amantadine&amp;referer=pillid\'>Amantadine Hydrochloride</a><br />\
                  <span>Strength:</span> 100 mg<br/>\
                  <span>Pill Imprint:</span> <a href=\'/imprints/c-122-6021.html\'>C-122</a><br /><span>Color:</span> Yellow<br /><span>Shape:</span> Capsule-shape</p>\
                  <a class=\'input-button small\' href=\'/imprints/c-122-6021.html\'>View Images &amp; Details</a>\
                  <a class=\'input-button input-button-outline-grey small\' href=\'/imprints/c-122-6021.html?printable=1\' rel=\'nofollow\' target=\'_blank\'><i class=\'icon icon-print\'>\
                  </i>Print</a>\
                </div>'

        soup = BeautifulSoup(page,'html.parser')  

        found = []

        hrefs = soup.find_all('a')
        p = re.compile('<a href.*>(.*)</a>', re.IGNORECASE)
        for h in hrefs:
            m = re.search(p,str(h)) 
            if m:
                found.append(m.group(1))

        found