Scrapy Spider:从第n行开始抓取网址列表

时间:2017-12-06 11:39:09

标签: python scrapy-spider

我正在使用Scrapy Spider从文本文件中抓取一个URL列表。我是Python和Scrapy的新手,我只需要完成它就可以完成这项任务。

我的网址列表相当大,所以这就是我实现它的方式:

from scrapy.spider import BaseSpider
from scrapy.http.request import Request
import time

class MySpider(BaseSpider):
    name = "example"
    allowed_domains = ["example.com"]


    def __init__(self, filename=None, delay=5, start_line = 0):
        self.currentline = 0
        self.download_delay = int(delay)
        self.filename = filename
        self.start_line = int(start_line)



    def start_requests(self):
        with open(self.filename, 'r') as f:
            for url in f.readlines():
                self.currentline +=1

                if self.currentline < self.start_line:
                    continue
                else:
                    print(self.currentline)
                    yield Request(url.strip(), self.parse)

    def parse(self, response):
        logfilename = 'log'
        with open(logfilename, 'a') as f:
            f.write('Crawled line ' + str(self.currentline) + ' of ' + self.filename + ': ' + response.url + '\n')

我现在没有解析任何问题,我稍后会担心,现在只需记录它。

我打电话给,说:

scrapy runspider myfolder\kwdSpider.py -a filename=myfolder\urls.txt -a delay=10 -a start_line=124

因为url列表可能非常大,所以我实现了从指定的start_line重新开始爬行的选项(并使用了yield Request())。它实际上一切正常,除了这个:

E:\Python27>scrapy runspider mysite\kwdSpider.py -a filename=example\urls.txt -a delay=8 -a start_line=124
E:\Python27\example\kwdSpider.py:5: ScrapyDeprecationWarning: kwdSpider.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
  class MySpider(BaseSpider):
2017-12-06 12:27:35+0100 [scrapy] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2017-12-06 12:27:35+0100 [scrapy] INFO: Optional features available: ssl, http11
2017-12-06 12:27:35+0100 [scrapy] INFO: Overridden settings: {}
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled item pipelines:
2017-12-06 12:27:35+0100 [example] INFO: Spider opened
2017-12-06 12:27:35+0100 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-06 12:27:35+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-06 12:27:35+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
2017-12-06 12:27:38+0100 [example] DEBUG: Crawled (200) <GET https://www.example.com/139th url> (referer: None)
140
2017-12-06 12:27:47+0100 [example] DEBUG: Crawled (200) <GET https://www.example.com/140th url> (referer: None)
141

查看跳过前十几个网址的方式?我显然不能正确理解蜘蛛是如何工作的,是不是在start_requests例程通过txt文件完成计数之前没有初始化(这是我认为实现它的唯一方式)?

奖金问题 - 这个通知是什么?

@kwdSpider.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
  class MySpider(BaseSpider):

感谢。

1 个答案:

答案 0 :(得分:0)

我将回答奖金问题,而主要问题正在进行调查。

关于警告:

@kwdSpider.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
 class MySpider(BaseSpider):

在消息之后,将代码更改为:

from scrapy.spider import Spider
from scrapy.http.request import Request
import time

class MySpider(Spider):

至少应该解决这个问题。