从图像中提取alt标签仅产生页面上的第一个标签

时间:2019-03-21 19:21:30

标签: python-3.x beautifulsoup

我需要从页面上的图像(仅在正文中)提取alt标签。下面的代码无法捕获它们,只是页面上的第一个。

r = requests.get('https://www.bbc.co.uk/news/uk-politics-47648565')
soup = BeautifulSoup(r.content, "html.parser")

alt_tags = []
bio_img_soup = [s for s in soup.find_all( 'span', {'class': 'image-and-copyright-container'})]
for div in bio_img_soup:
    for img in div.find_all('img', alt=True):
        alt_tags.append(img['alt'])     
print(alt_tags)

有人可以引导我找到解决方案吗?谢谢!

UPD:

使用硒时,如下所示,有时它可以工作,但有时仍然只能捕获第一张图像。

这是代码:

url = 'https://www.bbc.co.uk/news/uk-politics-47648565'

driver = webdriver.Chrome('/Users/vissea01/Downloads/chromedriver')
driver.get(url)

html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")

bios = []
bio_img_soup = [s for s in soup.find_all( 'span', {'class': 'image-and-copyright-container'})]
for div in bio_img_soup:
    for img in div.find_all('img', alt=True):
        bios.append(img['alt'])
bios = [i for i in bios if i != 'Presentational grey line' and i != 'Presentational white space']
print(bios)

driver.close()

相同的代码输出:

['Theresa May arriving in Brussels']

OR

['Theresa May arriving in Brussels', 'Analysis box by Katya Adler, Europe editor', 'Brexit timetable', 'Jeremy Corbyn']

1 个答案:

答案 0 :(得分:0)

页面是动态的。当您执行请求时,第一个图像是html源代码的一部分。之后渲染其他图像。您可以先使用Selenium呈现页面,然后拉出所有img标签。您可以使用Selenium然后获取这些标签,或者如果您像我一样,只是对bs4感到更舒服,则可以使用它。

from selenium import webdriver
from selenium.webdriver.common.by import By
import bs4
import pandas as pd

url = 'https://www.bbc.co.uk/news/uk-politics-47648565'

driver = webdriver.Chrome()
driver.get(url)

html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")

imgs = soup.find_all('img', alt=True)

for img in imgs:
    print (img['alt'])

driver.close()

输出:

Theresa May arriving in Brussels
Presentational grey line
Presentational grey line
Presentational grey line
Analysis box by Katya Adler, Europe editor
Presentational grey line
Brexit timetable
Presentational white space
Jeremy Corbyn
Theresa May arriving in Brussels
Anti-Brexit protests
Police at Parliament
‘It’s actually really good to get rejected’
How Brexit changed the English language
A forgotten food of the American South
Why water is one of the weirdest things in the Universe
What happens when we run out of food?
Canada's lake of methane
Imprints on the Sands of Time
Air India suspends Birmingham flights
Hen party mum to be buried in wedding dress
Is Kosovo’s capital city the ugliest in Europe?
Can a film be banned in the US?
Christine Chubbuck: The broadcaster who shot herself on air
[Gallery] The Worst Food From Every Single State
3 Ways Your Dog Asks For Help
[Gallery] This Is The Reason Clint Eastwood Never Discussed His Military Service
Seniors With No Life Insurance Feel Silly For Not Knowing This
No It's Not Oregano -- But This Plant Could Help You Retire Filthy Rich
This Holistic Remedy Improves Nail Fungus
Guns
Lauren and Dan Perkins with their six children
cyclone
Girl
Computer graphics
Guatemala village
Paris and Nanchanok
Kenyan boys and fishermen on Lake Victoria
Jacinda Ardern hugs woman
football being kicked on a field - Vauxhall image blurred in the background.