抓取要求您向下滚动的网站

时间:2017-08-10 17:49:38

标签: javascript python dynamic beautifulsoup bs4

我想在这里抓住这个网站:

但是,它需要向下滚动才能收集其他数据。我不知道如何使用美丽的汤或蟒蛇向下滚动。这里有人知道怎么做?

代码有点乱,但现在是。

import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time

class MLStripper(HTMLParser):


class MySpider(scrapy.Spider):
        name = "A1Locker"

        def strip_tags(html):
            s = MLStripper()
            s.feed(html)
            return s.get_data()

     allowed_domains = ['https://www.a1lockerrental.com']
    start_urls = ['http://www.a1lockerrental.com/self-storage/mo/st-
 louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
 category=all']

     def parse(self, response):

                 url='http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
                driver = webdriver.Firefox()
                driver.get(url)
                html = driver.page_source
                soup = BeautifulSoup(html, 'html.parser')
        url2='http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
        driver2 = webdriver.Firefox()
                driver2.get(url2)
                html2 = driver.page_source
                soup2 = BeautifulSoup(html2, 'html.parser')                
                #soup.append(soup2)
                #print soup
        items = []
        inside = "Indoor"
                outside = "Outdoor"
        inside_units = ["5 x 5", "5 x 10"]
        outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x 
20","10 x 25","10 x 30"]
        sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
        sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
        #print soup.findAll('span',{"class":"sss-unit-size"})



        rateTagz = soup.findAll('p',{"class":"unit-special-offer"})


        specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
        typesTagz = soup.findAll('div',{"class":"unit-info"},)

        rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})


        specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
        typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
        yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
                'name': "A1Locker"
                   }
        size = []
        for n in range(len(sizeTagz)):
                    print len(rateTagz)
                    print len(typesTagz)

                    if "Outside" in (typesTagz[n]).get_text():



                            size.append(re.findall(r'\d+',
 (sizeTagz[n]).get_text()))
                            size.append(re.findall(r'\d+',
 (sizeTagz2[n]).get_text()))
                            print "logic hit"
                for i in range(len(size)):
                        yield {
                    #soup.findAll('p',{"class":"icon-bg"})
                    #'name': soup.find('strong', {'class':'high'}).text

                    'size': size[i]
                    #"special": (specialTagz[n]).get_text(),
                    #"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
                    #"size": i.css(".sss-unit-size::text").extract(),
                    #"types": "Outside"

    }
            driver.close()

代码的所需输出是显示从此网页收集的数据:http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?category=all

要这样做,需要能够向下滚动以查看其余数据。至少这就是我的想法。

谢谢, DM123

2 个答案:

答案 0 :(得分:2)

有一个提供此功能的webdriver功能。除了解析网站之外,BeautifulSoup不做任何事情。

检查出来:http://webdriver.io/api/utility/scroll.html

答案 1 :(得分:1)

您尝试抓取的网站是使用JavaScript动态加载内容。不幸的是,许多网络刮刀,如美丽的汤,不能自己执行JavaScript。然而,有许多选项,其中许多是无头浏览器的形式。一个经典的是PhantomJS,但值得一看这个great list of options on GitHub,其中一些可能与美丽的汤,如Selenium很好地搭配。

记住Selenium,this Stackoverflow question的答案也可能有所帮助。