Python / Scrapy:抓取start_urls后CrawlSpider停止

时间:2017-03-09 17:14:46

标签: python scrapy scrapy-spider

我浪费了很多时间来思考Scrapy,阅读文档以及其他Scrapy博客和Q& A ......而现在我将要做的事情最让人讨厌:询问方向;-)问题是:我的蜘蛛打开,取出start_urls,但显然对它们没有任何作用。相反,它会立即关闭,就是这样。显然,我甚至没有进入第一个self.log()声明。

到目前为止我得到的是:

# -*- coding: utf-8 -*-
import scrapy
# from scrapy.shell import inspect_response
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse, FormRequest, Request
from KiPieSpider.items import *
from KiPieSpider.settings import *

class KiSpider(CrawlSpider):
    name = "KiSpider"
    allowed_domains = ['www.kiweb.de', 'kiweb.de']
    start_urls = (
        # ST Regra start page:
        'https://www.kiweb.de/default.aspx?pageid=206',
            # follow ST Regra links in the form of:
            # https://www.kiweb.de/default.aspx?pageid=206&page=\d+
            # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
        # ST Thermo start page:
        'https://www.kiweb.de/default.aspx?pageid=202&page=1',
            # follow ST Thermo links in the form of:
            # https://www.kiweb.de/default.aspx?pageid=202&page=\d+ 
            # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6}
    )
    rules = (
        # First rule that matches a given link is followed / parsed.
        # Follow category pagination without further parsing:
        Rule(
            LinkExtractor(
                # Extract links in the form:
                allow=r'Default\.aspx?pageid=(202|206])&page=\d+',
                # but only within the pagination table cell:
                restrict_xpaths=('//td[@id="ctl04_teaser_next"]'),
            ),
            follow=True,
        ),
        # Follow links to category (202|206) articles and parse them:
        Rule(
            LinkExtractor(
                # Extract links in the form:
                allow=r'Default\.aspx?pageid=299&docid=\d+',
                # but only within article preview cells:
                restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"),
            ),
            # and parse the resulting pages for article content:
            callback='parse_init',
            follow=False,
        ),
    )

    # Once an article page is reached, check whether a login is necessary:
    def parse_init(self, response):
        self.log('Parsing article: %s' % response.url)
        if not response.xpath('input[@value="Logout"]'):
            # Note: response.xpath() is a shortcut of response.selector.xpath()
            self.log('Not logged in. Logging in...\n')
            return self.login(response)
        else:
            self.log('Already logged in. Continue crawling...\n')
            return self.parse_item(response)


    def login(self, response):
        self.log("Trying to log in...\n")
        self.username = self.settings['KI_USERNAME']
        self.password = self.settings['KI_PASSWORD']
        return FormRequest.from_response(
            response,
            formname='Form1',
            formdata={
                # needs name, not id attributes!
                'ctl04$Header$ctl01$textbox_username': self.username,
                'ctl04$Header$ctl01$textbox_password': self.password,
                'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort',
                'ctl04$Header$ctl01$checkbox_permanent': 'True',
            },
            callback = self.parse_item,
        )

    def parse_item(self, response):
        articles = response.xpath('//div[@id="artikel"]')
        items = []
        for article in articles:
            item = KiSpiderItem()
            item['link'] = response.url
            item['title'] = articles.xpath("div[@class='ct1']/text()").extract()
            item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract()
            item['article'] = articles.extract()
            item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE")
            item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]")
            item['lang'] = 'de-DE'
            items.append(item)
#       return(items)
        yield items
#       what is the difference between return and yield?? found both on web.

执行scrapy crawl KiSpider时,会产生:

2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider)
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider (info@defrent.de)', 'DOWNLOAD_DELAY': 0.25}
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None)
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 465,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 48998,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)}
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished)

登录例程是不应该以回调结束,而是某种返回/ yield语句?或者我做错了什么?遗憾的是,到目前为止我所看到的文档和教程只能让我对每个位置与其他位置的联系有一个模糊的概念,特别是Scrapy的文档似乎是为那些已经了解Scrapy的人们编写的。

有点沮丧的问候 克里斯托弗

1 个答案:

答案 0 :(得分:0)

ActiveSheet.Range("A:A").RemoveDuplicates Columns:=Array(1), Header:=xlYes

您不需要rules = ( # First rule that matches a given link is followed / parsed. # Follow category pagination without further parsing: Rule( LinkExtractor( # Extract links in the form: # allow=r'Default\.aspx?pageid=(202|206])&page=\d+', # but only within the pagination table cell: restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), ), follow=True, ), # Follow links to category (202|206) articles and parse them: Rule( LinkExtractor( # Extract links in the form: # allow=r'Default\.aspx?pageid=299&docid=\d+', # but only within article preview cells: restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), ), # and parse the resulting pages for article content: callback='parse_init', follow=False, ), ) 参数,因为XPath选择的标记中只有一个链接。

我不理解allow参数中的正则表达式,但至少你应该转义allowenter image description here