不规则的Python / Scrapy抓取行为:状态200但数据并不总是返回

时间:2018-03-10 13:46:47

标签: python web-scraping scrapy

我写了一篇Python Scrapy蜘蛛来从一个网站上搜集60,000页。提取的每个项目都很有用,当刮大约4000页时,我没有问题,我从每个页面获得200状态,并且提取所需的所有数据。

然而,当我开始向上刮8000页时,我有时能够使用200个状态代码来抓取所有数据,但在其他情况下我返回的项目非常少。输出仍然显示所有8000个URL和每个200的状态代码,但提取的关联项很少。

我注意到如果抓取工具已成功抓取所有8000个页面,然后几乎立即再次运行,则会发生此行为。它在尝试刮掉所有60,000页时看到的行为相同。

我在settings.py文件中启用了以下功能:

ROBOTSTXT_OBEY = True

我的代码如下:

import scrapy
import re
import sqlite3
import time
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class UserItem(scrapy.Item):
    url = scrapy.Field()
    status = scrapy.Field()
    registered = scrapy.Field()

class TestSpider(scrapy.Spider):
    name = "website"
    allowed_domains = ['source_website']
    start_urls = ['http://source_website/page1',
                  'http://source_website/page2',
                  'http://source_website/page3',
                  'http://source_website/page4',
                  'http://source_website/page5',]


    def parse(self, response):
        hxs = scrapy.Selector(response)
        all_links = hxs.xpath('*//a/@href').extract()
        for link in all_links:
            yield scrapy.http.Request(url=response.urljoin(link), callback=self.ParseContents)

    def ParseContents(self, response):
        item = UserItem()

        item['url'] = response.url
        item['status'] = response.status

        data = (response.xpath('//html//text()').extract())
        data = iter(data)
        for object in data:
            if "Registered" in object:
                item['registered'] = (next(data))

        yield item

成功的刮刮日志:

2018-03-10 17:49:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
 'downloader/request_bytes': 2489755,
 'downloader/request_count': 7524,
 'downloader/request_method_count/GET': 7524,
 'downloader/response_bytes': 19301544,
 'downloader/response_count': 7524,
 'downloader/response_status_count/200': 7520,
 'downloader/response_status_count/302': 4,
 'dupefilter/filtered': 45,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 10, 17, 49, 9, 119209),
 'item_scraped_count': 7517,
 'log_count/DEBUG': 15046,
 'log_count/INFO': 14,
 'memusage/max': 123379712,
 'memusage/startup': 49389568,
 'offsite/domains': 2,
 'offsite/filtered': 4,
 'request_depth_max': 1,
 'response_received_count': 7520,
 'scheduler/dequeued': 7524,
 'scheduler/dequeued/memory': 7524,
 'scheduler/enqueued': 7524,
 'scheduler/enqueued/memory': 7524,
 'start_time': datetime.datetime(2018, 3, 10, 17, 42, 25, 211537)}
2018-03-10 17:49:09 [scrapy.core.engine] INFO: Spider closed (finished)

具有讽刺意味的是,在更新此版本时,我无法生成不成功的刮刮日志,因为此时蜘蛛刮擦了8000多页。

这可能是网站管理其接收的流量,因此蜘蛛的行为是零星的吗?

非常感谢任何关于如何解决这个问题的想法。

0 个答案:

没有答案