获取图片网址以显示一个图片名称

时间:2017-07-26 18:58:37

标签: python python-3.x scrape imageurl

有这个问题。我不知道如何展示一个img。例如:

<img srcset="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s180/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 180w, http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s390/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 390w, http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s458/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 458w" src="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s615/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg">

正如您在上面所看到的,有不同的替代图像,但我试图刮掉一个要显示的图像。

import bs4 as bs
import urllib.request
import datetime
import random 
import re


random.seed(datetime.datetime.now())

sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

# 




title = soup.title
link = soup.link
image = re.search(img 'srcset=img(.*?),)  
 #this doesnt work, not sure how to 

strong = soup.strong
description = soup.description
location = soup.location


title = soup.find('h1', class_ ='publication-font', )   

image = soup.find('img')
strong = soup.find('strong')
location = soup.find('em').find('a')
description = soup.find('div', class_='description',to.text)


#Previous Code
print("H1:", title.text)
print("Article Link:", link)
print("Image Url:\n", image)
print("1st Paragraph:\n", strong.text)
print("2nd Paragraph:\n", description.string)
print("Location:\n", location.text)

我的代码在上面,但是在我之前尝试时的上一个结果会显示:

Greater Manchester News
<link href="rss.xml" rel="alternate" title="Default home feed" 

type="application/rss+xml"/>

<img data-`src="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNA`TES/s615/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg" data-`srcset="http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTE`RNATES/s180/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-Trafford-home-last-Thursday.jpg 180w,` http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALT`ERNATES/s

390/Mike-Grimshaw-34-was-fatally-attacked-following-the-attack-outside-his-`Trafford-home-last-Thursday.jpg 390w, `http://i4.manchestereveningnews.co.uk/incoming/article13390833.ece/ALTERNATES/s458/Mike-Grimshaw-34-was-fatally-attacked-following-t`he-attack-outs`ide-his-

Trafford-home-last-Thursday.jpg 458w"/>
        Family of dad stabbed in the neck while defendin

g his fiancée from thugs speak of their heartbreak
        Mike Grimshaw, 34, died after being stabbed in the neck outside his 

home in Trafford last Thursday

Trafford

在结果中,显示多个图像名称,但我尝试仅显示单个图像链接。我该怎么做呢

任何想法都会非常感激。

1 个答案:

答案 0 :(得分:0)

您可以访问属性data-srcdata-srcset以获取所需的图片:

image = soup.find('img')
single_img = image.get('data-src') # return the main image link

import re
image = soup.find('img')
img_string = image.get('data-srcset') # this return a string you have to parse 
img_set = re.findall(r'(https?://[^\s]+)', img_set) # regex to match only links

然后你可以在img_set中访问你想要的任何索引(只测试列表的长度)