我将如何抓取这些嵌套的img标签?

时间:2020-05-21 07:27:18

标签: python-3.x web-scraping beautifulsoup lxml

我当时从标题中刮取了这个site,并且还尝试了刮除标题之后的图像。原来是在抓取时返回了以下数据:

<div itemscope itemtype="https://schema.org/ItemList" class="group card-8-group-1 clearfix">
              <meta itemprop="itemListOrder" content="https://schema.org/ItemListOrderDescending" />
              <article  itemprop="itemListElement" itemscope itemtype="https://schema.org/Article" class="card card-1 news-card-1 card-type-article  type-article" data-sponsorship-type="card" data-sponsorship-article-id="1qo8sz0z1kaqb1dpj038v8658h" data-sponsorship-article-type="article" data-sponsorship-primary-tag="1pgecmpab62ei1akyb084izq3o" data-sponsorship-secondary-tag="22doj4sgsocqpxw45h607udje">
                 <a data-side="link" href="/en/news/spurs-investigation-aurier-appears-break-lockdown-protocols/1qo8sz0z1kaqb1dpj038v8658h" itemprop="url" data-sponsorship-slot="card" data-sponsorship-slot-id="front" class="type-article">
                    <div class="picture article-image" data-module="responsive-picture">
                       <img class="picture__image picture__image--lazyload" data-srcset="&amp;quality=60&amp;w=640 320w,&amp;quality=60&amp;w=560 480w,&amp;quality=60&amp;w=690 740w,&amp;quality=60&amp;w=800 980w,&amp;quality=60&amp;w=970 1580w" /> 
                       <noscript class="picture__polyfill"> <img src="https://images.daznservices.com/di/library/GOAL/5f/da/serge-aurier_191f5i34z69us1fausrs9k0mjk.jpg?t=1445827096&amp;quality=60&amp;h=170" alt="Serge Aurier" /> </noscript>
                    </div>
                    <div class="title">
                       <h3 title="Spurs launch investigation as Aurier appears to break lockdown protocols for a third time" itemprop="headline">Aurier appears to break lockdown protocols for a third time</h3>
                       <div class="image" data-sponsorship-slot="card" data-sponsorship-slot-id="image"></div>
                    </div>

它似乎页面正在使用延迟加载。我的问题是如何提取具有完整缩放的img?

1 个答案:

答案 0 :(得分:1)

要获取完整尺寸的图像,只需在图像URL中手动将w=55替换为w=970即可。

例如:

import requests
from bs4 import BeautifulSoup

url = 'https://www.goal.com/en/premier-league/2kwbbcootiqqgmrzs6o5inle5'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for title, image in zip(soup.select('.card-type-article h3'),
                        soup.select('.card-type-article img')):
    title = title.get_text(strip=True)
    full_img_url = image['src'].replace('w=55', 'w=970')

    print('{:<70}{}'.format(title, full_img_url))

打印:

Wenger calls for FFP reform amid Newcastle takeover talk              https://images.daznservices.com/di/library/GOAL/63/cd/arsene-wenger-2019_13luew9ltpa2g1l1r6ziuxpwbw.jpg?t=1363081390&quality=60&w=970
'Special Havertz is half-Ozil, half-Ballack & would thrive in PL'     https://images.daznservices.com/di/library/GOAL/cc/18/kai-havertz_7sugon9o7ljy1fg2xzkv1mqcm.jpg?t=-1186202400&quality=60&w=970
Solskjaer: I'd rather a hole in my squad than an asshole              https://images.daznservices.com/di/library/GOAL/78/f2/ole-gunnar-solskjaer-manchester-united-2019-20_1vfk6liknrjlx1r8aumegh4cxe.jpg?t=-749345265&quality=60&w=970
Maguire praises Man Utd's 'safe' training return                      https://images.daznservices.com/di/library/GOAL/5d/e8/harry-maguire-man-utd_13ewrih27ahmb13i1zxfjrhrp8.jpg?t=-444094625&quality=60&w=970
Jorginho's agent opens door for Juve move                             https://images.daznservices.com/di/library/GOAL/69/da/jorginho-chelsea-2019-20_15zh5m3ojefx0zl1ei7qsyc14.jpg?t=-1675997073&quality=60&w=970
Premier League clubs near approval for contact training               https://images.daznservices.com/di/library/GOAL/79/ce/mohamed-salah-dejan-lovren-liverpool-training_7zq70upa8l1618svdzls077xn.jpg?t=143669454&quality=60&w=970
Ceballos reiterates desire to succeed at Real Madrid                  https://images.daznservices.com/di/library/GOAL/97/c6/dani-ceballos-arsenal_1sywf8w828w4b193xoz5c82uuf.jpg?t=-1552361252&quality=60&w=970