使用BeautifulSoup提取图像标题和图像网址

时间:2017-04-20 17:48:19

标签: python html parsing beautifulsoup

我正在尝试使用BeautifulSoup从文章中提取图片网址和图片标题。我可以将文章的图片网址和图片标题与之前和之后的H​​TML分开,但我无法弄清楚如何将这两者与他们的html标签分开。这是我的代码:

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-
koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-
letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})

我想要提取的两个部分是src =和title = sections。关于如何完成这两个解析的任何想法都将不胜感激。

3 个答案:

答案 0 :(得分:3)

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})
print [i.find('img')['src'] for i in links]
print [i.find('img')['title'] for i in links]

答案 1 :(得分:0)

尝试以下操作以提取所有图像标记

img = soup.findAll('img')
#depending on how many images are here you will probably need to loop through img
src = img.get('src')
title = img.get('title')

答案 2 :(得分:0)

迟到的答案,但您可以使用:

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html5lib")
links = soup.find_all('div', {'class': 'image'})
if links:
    print(links[0].find('img')['src'])
    print(links[0].find('img')['title'])

输出:

  

http://mma.prnewswire.com/media/491859/Koert_van_Mensvoort.jpg?w=950

     

荷兰哲学家Koert van Mensvoort--“下一个自然”的创始人   中国科技大学“下一个自然”网络与研究员   埃因霍温 - 写了一封“给人类的信”来支持   国际地球日。 (PRNewsfoto / Next Nature Network)