抓住python web scraping中article article下的第一个链接

时间:2017-09-20 04:58:21

标签: python html web-scraping

我正试图抓住article标签下的第一个链接。到目前为止我有这个

def getByName(name: String) = {
    select(_.id, _.name_list)
      .where(_.name_list.contains(name))
      .allowFiltering()
      .one()
  }

抓取文章标记

下的两个链接

    http://images.media-allrecipes.com/userphotos/250x250/00/17/17/171761.jpg'">     

for link in soup.find("section", {"id": "grid"}).findAll("a", href=re.compile("/recipe/[0-9]*/.*/")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

            

<a href="/recipe/17066/janets-rich-banana-bread/" data-internal-referrer-link='hub recipe' data-click-id='cardslot 2' >

    <img class="grid-col__rec-image" data-lazy-load data-original-src="http://images.media-allrecipes.com/userphotos/250x250/00/17/17/171761.jpg" alt="Janet's Rich Banana Bread Recipe and Video - Sour cream guarantees a moist and tender loaf.  And bananas are sliced instead of mashed in this recipe, giving a concentrated banana taste in every bite." title="Janet's Rich Banana Bread Recipe and Video"  src="http://images.media-allrecipes.com/ar/spacer.gif" style="display: inline;" />

    <h3 class="grid-col__h3 grid-col__h3--recipe-grid">
        Janet's Rich Banana Bread
            <div class="grid-col__video">
                <a href="/video/1027/janets-rich-banana-bread/" data-internal-referrer-link='hub recipe' data-click-id='cardslot 2'><span class="icon--videoplay-small-white"></span></a>
            </div>
    </h3>
</a>
<a href="/recipe/17066/janets-rich-banana-bread/" data-internal-referrer-link='hub recipe' data-click-id='cardslot 2'>
    <div class="grid-col__ratings">
        <div class="rating-stars" data-scroll-to-anchor="reviews" data-ratingstars= 4.82000017166138 >
    <img height="16" width="16" src="http://images.media-allrecipes.com/ar-images/icons/rating-stars/full-star-2015.svg"  />
    <img height="16" width="16" src="http://images.media-allrecipes.com/ar-images/icons/rating-stars/full-star-2015.svg"  />
    <img height="16" width="16" src="http://images.media-allrecipes.com/ar-images/icons/rating-stars/full-star-2015.svg"  />
    <img height="16" width="16" src="http://images.media-allrecipes.com/ar-images/icons/rating-stars/full-star-2015.svg"  />
    <img height="16" width="16" src="http://images.media-allrecipes.com/ar-images/icons/rating-stars/full-star-2015.svg"  />

你可以看到那里有两个链接,我试图只获得第一个链接。任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:0)

'find'函数总是返回一个元素,而'findAll'返回所有元素(在本例中为所有链接)。 或者你可以在findAll:

中使用limit参数
first_link=soup.findAll("a", limit=1)

first_link=soup.find("a")

参考: https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Searching%20the%20Parse%20Tree