Scrapy不会点击所有页面

时间:2016-07-04 08:43:52

标签: python selenium scrapy

我正在使用Scrapy来浏览网上商店。这些产品是动态加载的,这就是我使用Selenium来浏览页面的原因。我开始刮掉所有类别,然后调用main函数。

爬行每个类别时出现问题:指示蜘蛛从第一页抓取所有数据,然后单击按钮进入下一页,直到没有按钮为止。如果我只是将一个类别的URL放入start_url,代码就可以正常工作,但奇怪的是,如果我在主代码中运行它,它就不会点击所有页面。在完成单击所有下一个按钮之前,它会随机切换到新类别。

我不知道为什么会这样。

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class horniSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ["example.com"]
    start_urls = ['https://www.example.com']

    def parse(self, response):
        for post in response.xpath('//body'):
            item = HorniItem()
            for href in response.xpath('//li[@class="sub"]/a/@href'):
                item['maincategory'] = response.urljoin(href.extract())
                yield scrapy.Request(item['maincategory'], callback = self.parse_subcategories)

    def parse_subcategories(self, response):
        item = HorniItem()
        for href in response.xpath('//li[@class="sub"]/a/@href'):
            item['subcategory'] = response.urljoin(href.extract())
            yield scrapy.Request(item['subcategory'], callback = self.parse_articles)


    def __init__(self):
            self.driver = webdriver.Chrome()
            dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
            self.driver.close()

    def parse_articles(self, response):
            self.driver.get(response.url)
            response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
            item = HorniItem()
            for sel in response.xpath('//body'):
                item['title'] = sel.xpath('//div[@id="article-list-headline"]/div/h1/text()').extract()
                yield item
            for post in response.xpath('//body'):
            id = post.xpath('//a[@class="title-link"]/@href').extract()
            prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
                articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
                id = [i.split('/')[-2] for i in id]
            prices = [x for x in prices if x != u'\xa0']
                articles = [w.replace(u'\n', '') for w in articles]
                result = zip(id, prices, articles)
                for id, price, article in result:
                        item = HorniItem()
                        item['id'] = id
                item['price'] = price
                        item['name'] = article
                        yield item
            while True:
                next = self.driver.find_element_by_xpath('//div[@class="paging-wrapper"]/a[@class="paging-btn right"]')
                try:
                        next.click()
                    response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
                item = HorniItem()
                    for post in response.xpath('//body'):
                    id = post.xpath('//a[@class="title-link"]/@href').extract()
                    prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
                        articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
                        id = [i.split('/')[-2] for i in id]
                    prices = [x for x in prices if x != u'\xa0']
                        articles = [w.replace(u'\n', '') for w in articles]
                        result = zip(id, prices, articles)
                        for id, price, article in result:
                            item = HorniItem()
                                item['id'] = id
                        item['price'] = price
                                item['name'] = article
                                yield item
                except:
                        break

更新

所以问题似乎在于DOWNLOAD_DELAY - 设置。由于网站上的下一个按钮实际上不会生成新的URL而只是执行Java脚本,因此站点URL不会更改。

1 个答案:

答案 0 :(得分:0)

我找到了答案:

问题在于,由于页面内容是动态生成的,因此单击NEXT - 按钮实际上并未更改网址。结合项目的DOWNLOAD_DELAY - 设置,这意味着蜘蛛在页面上停留了一段时间,无论它是否能够点击每个可能的NEXT - 按钮。

DOWNLOAD_DELAY - 设置设置得足够高,允许蜘蛛在每个网址上保持足够长的时间并抓取每个网页。

但问题是,这会强制蜘蛛在每个网址上等待设定的时间,即使没有NEXT - 按钮也可以点击。但是......好吧......