Scrapy 0页已爬网,但没有可见问题?

时间:2018-07-24 04:32:35

标签: web-scraping scrapy scrapy-spider scrapinghub portia

我使用Portia创建了一个Spider,然后将其下载为scrapy项目。蜘蛛程序运行良好,但在日志中说:Scrapy Crawled爬行了0页(以0页/分钟的速度),也没有任何保存。但是,它还会显示所有以200响应爬网的页面,然后在末尾显示数据字节。

蜘蛛代码

from __future__ import absolute_import

from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule

from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url, Image, Regex
from ..items import PortiaItem, AllProductsBooksToScrapeSandboxItem


class BooksToscrape(BasePortiaSpider):
    name = "books.toscrape.com"
    allowed_domains = ['books.toscrape.com']
    start_urls = [{'fragments': [{'valid': True,
                                  'type': 'fixed',
                                  'value': 'http://books.toscrape.com/catalogue/page-'},
                                 {'valid': True,
                                  'type': 'range',
                                  'value': '1-50'},
                                 {'valid': True,
                                  'type': 'fixed',
                                  'value': '.html'}],
                   'type': 'generated',
                   'url': 'http://books.toscrape.com/catalogue/page-[1-50].html'}]
    rules = [
        Rule(
            LinkExtractor(
                allow=(),
                deny=('.*')
            ),
            callback='parse_item',
            follow=True
        )
    ]
    items = [
        [
            Item(
                AllProductsBooksToScrapeSandboxItem, None, '.product_pod', [
                    Field(
                        'title', 'h3 > a::attr(title)', []), Field(
                        'price', '.product_price > .price_color *::text', [])])]]

管道代码 我添加了openSpider和closeSpider函数,以在爬网时将项目写入json行,并且我认为它可以工作,因为创建了jl文件。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TesterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

设置代码 在设置中也启用了管道,以使管道正常工作。

# -*- coding: utf-8 -*-

# Scrapy settings for Tester project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tester'

SPIDER_MODULES = ['Tester.spiders']
NEWSPIDER_MODULE = 'Tester.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tester (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tester.middlewares.TesterSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tester.middlewares.TesterDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Tester.pipelines.TesterPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

运行蜘蛛时,会创建以下 log

(scrape) C:\Users\da74\Desktop\tester>scrapy crawl books.toscrape.com
2018-07-24 12:18:15 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: Tester)
2018-07-24 12:18:15 [scrapy.utils.log] INFO: Versions: lxml 4.2.2.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-07-24 12:18:15 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Tester', 'NEWSPIDER_MODULE': 'Tester.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Tester.spiders']}
2018-07-24 12:18:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-07-24 12:18:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-24 12:18:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-24 12:18:16 [scrapy.middleware] INFO: Enabled item pipelines:
['Tester.pipelines.TesterPipeline']
2018-07-24 12:18:16 [scrapy.core.engine] INFO: Spider opened
2018-07-24 12:18:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-24 12:18:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-24 12:18:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2018-07-24 12:18:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-26.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-27.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-32.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-29.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-30.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-33.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-28.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-31.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-34.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-35.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-36.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-39.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-40.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-38.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-41.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-37.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-42.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-43.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-44.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-47.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-45.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-46.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-48.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-49.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-50.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-24 12:18:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 12168,
 'downloader/request_count': 51,
 'downloader/request_method_count/GET': 51,
 'downloader/response_bytes': 299913,
 'downloader/response_count': 51,
 'downloader/response_status_count/200': 50,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 7, 24, 4, 18, 18, 598891),
 'log_count/DEBUG': 52,
 'log_count/INFO': 7,
 'response_received_count': 51,
 'scheduler/dequeued': 50,
 'scheduler/dequeued/memory': 50,
 'scheduler/enqueued': 50,
 'scheduler/enqueued/memory': 50,
 'start_time': datetime.datetime(2018, 7, 24, 4, 18, 16, 208142)}
2018-07-24 12:18:18 [scrapy.core.engine] INFO: Spider closed (finished)

我不明白为什么它不收集物品。我说首先是0个项目被抓取,然后显示200个页面成功响应。 如果有人有什么想法请尝试进行爬网,请帮忙。 谢谢

0 个答案:

没有答案