抓取网址图片

时间:2020-11-02 14:01:32

标签: python web-scraping

我正在尝试从knowyourmeme.com获取url图片。但是,我没有从要剪贴的9页图像中获取所有网址。

我的代码如下:

from requests_html import HTMLSession
import time

session = HTMLSession()

last_page=10

url_image=[]

for page in range(1, last_page):
  time.sleep(1)
  r2= session.get('https://knowyourmeme.com/page/'+str(page))

  # Get url_image

  meme_url_img = r2.html.xpath('/html/body/div/div/div/article/div/section/div/a/picture/source') 
  #meme_url_img = r2.html.find(".newsfeed > article > div > section > .media > a > picture > source")  
         
  meme_url_extraction = url_image.append([element.attrs['srcset'] for element in meme_url_img])

如您所见,我已经使用css选择器和xpath进行了尝试,但没有得到所有结果。每页5张图像应为45。我在做什么错了?

谢谢。

1 个答案:

答案 0 :(得分:1)

要从服务器获取正确答案,请设置User-Agent HTTP标头。另外,简单的source[srcset] CSS选择器就足够了:

import requests
from bs4 import BeautifulSoup


url = 'https://knowyourmeme.com/page/{page}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}

last_page=10

for page in range(1, last_page):
    u = url.format(page=page)
    print('Page no.{}..'.format(page))
    soup = BeautifulSoup(requests.get(u, headers=headers).content, 'html.parser')
    for img in soup.select('source[srcset]'):
        print(img['srcset'])
    print('-' * 80)

打印:

Page no.1..
https://i.kym-cdn.com/news_feeds/icons/mobile/000/050/321/578.jpg
https://i.kym-cdn.com/photos/images/newsfeed/001/926/651/310.jpg
https://i.kym-cdn.com/news_feeds/icons/mobile/000/050/315/506.jpg
https://i.kym-cdn.com/editorials/icons/mobile/000/001/920/Screen_Shot_2020-10-30_at_3.28.31_PM.jpg
https://i.kym-cdn.com/photos/images/newsfeed/001/910/696/c6e.jpg
--------------------------------------------------------------------------------
Page no.2..
https://i.kym-cdn.com/news_feeds/icons/mobile/000/050/296/f26.jpg
https://i.kym-cdn.com/news_feeds/icons/mobile/000/050/320/cac.jpg
https://i.kym-cdn.com/photos/images/newsfeed/001/923/913/1c5.jpg
https://i.kym-cdn.com/news_feeds/icons/mobile/000/050/310/f1b.jpg
https://i.kym-cdn.com/editorials/icons/mobile/000/001/921/cover1.jpg
--------------------------------------------------------------------------------
Page no.3..
https://i.kym-cdn.com/editorials/icons/mobile/000/001/922/image_(55).jpg
https://i.kym-cdn.com/photos/images/newsfeed/001/927/929/e14.jpg
https://i.kym-cdn.com/news_feeds/icons/mobile/000/050/317/bb5.jpg
https://i.kym-cdn.com/photos/images/newsfeed/001/926/224/a84.jpg
https://i.kym-cdn.com/news/posts/original/000/000/766/Screen_Shot_2020-10-30_at_2.51.26_PM.png
--------------------------------------------------------------------------------

...and so on.