使用BeautifulSoup从页面中搜集所有结果

时间:2018-03-03 20:26:15

标签: python web-scraping beautifulsoup

                              **Update**
         ===================================================

好的伙计们,到目前为止一切顺利。我有代码可以让我刮掉图像,但它以奇怪的方式存储它们。它会先下载40多张图像,然后创建另一个小猫图像。以前创建的小猫中的文件夹'文件夹并重新开始(下载与第一个文件夹中相同的图像)。我该怎么改变它?这是代码:

from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup as soup
import requests
import time
import os

image_tags = []

driver = webdriver.Chrome()
driver.get(url='https://www.pexels.com/search/kittens/')
last_height = driver.execute_script('return document.body.scrollHeight')

while True:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(1)
new_height = driver.execute_script('return document.body.scrollHeight')
if new_height == last_height:
    break
else:
    last_height = new_height

sp = soup(driver.page_source, 'html.parser')

for img_tag in sp.find_all('img'):
    image_tags.append(img_tag)


if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        with open('kitten-{}.jpg'.format(x), 'wb') as f:
            f.write(requests.get(url).content)
            x += 1
    except:
        pass

=============================================== ============================

我试图写一个蜘蛛从某些页面刮掉小猫的图像。我遇到了小问题,因为我的蜘蛛只获得了前15张图像。我知道这可能是因为页面在向下滚动后加载了更多图像。我该如何解决这个问题? 这是代码:

import requests
from bs4 import BeautifulSoup as bs
import os


url = 'https://www.pexels.com/search/cute%20kittens/'

page = requests.get(url)
soup = bs(page.text, 'html.parser')

image_tags = soup.findAll('img')

if not os.path.exists('kittens'):
    os.makedirs('kittens')

os.chdir('kittens')

x = 0

for image in image_tags:
    try:
        url = image['src']
        source = requests.get(url)
        if source.status_code == 200:
            with open('kitten-' + str(x) + '.jpg', 'wb') as f:
                f.write(requests.get(url).content)
                f.close()
                x += 1
    except:
        pass

1 个答案:

答案 0 :(得分:1)

由于网站是动态的,因此您需要使用浏览器操作工具,例如selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import os
driver = webdriver.Chrome()
driver.get('https://www.pexels.com/search/cute%20kittens/')
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
  driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  time.sleep(0.5)
  new_height = driver.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
     break
  last_height = new_height

image_urls = [i['src'] for i in soup(driver.page_source, 'html.parser').find_all('img')]
if not os.path.exists('kittens'):
  os.makedirs('kittens')
os.chdir('kittens')
with open('kittens.txt') as f:
  for url in image_urls:
    f.write('{}\n'.format(url))