我正在使用Scrapy Spider从文本文件中抓取一个URL列表。我是Python和Scrapy的新手,我只需要完成它就可以完成这项任务。
我的网址列表相当大,所以这就是我实现它的方式:
from scrapy.spider import BaseSpider
from scrapy.http.request import Request
import time
class MySpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
def __init__(self, filename=None, delay=5, start_line = 0):
self.currentline = 0
self.download_delay = int(delay)
self.filename = filename
self.start_line = int(start_line)
def start_requests(self):
with open(self.filename, 'r') as f:
for url in f.readlines():
self.currentline +=1
if self.currentline < self.start_line:
continue
else:
print(self.currentline)
yield Request(url.strip(), self.parse)
def parse(self, response):
logfilename = 'log'
with open(logfilename, 'a') as f:
f.write('Crawled line ' + str(self.currentline) + ' of ' + self.filename + ': ' + response.url + '\n')
我现在没有解析任何问题,我稍后会担心,现在只需记录它。
我打电话给,说:
scrapy runspider myfolder\kwdSpider.py -a filename=myfolder\urls.txt -a delay=10 -a start_line=124
因为url列表可能非常大,所以我实现了从指定的start_line重新开始爬行的选项(并使用了yield Request()
)。它实际上一切正常,除了这个:
E:\Python27>scrapy runspider mysite\kwdSpider.py -a filename=example\urls.txt -a delay=8 -a start_line=124
E:\Python27\example\kwdSpider.py:5: ScrapyDeprecationWarning: kwdSpider.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
2017-12-06 12:27:35+0100 [scrapy] INFO: Scrapy 0.25.1 started (bot: scrapybot)
2017-12-06 12:27:35+0100 [scrapy] INFO: Optional features available: ssl, http11
2017-12-06 12:27:35+0100 [scrapy] INFO: Overridden settings: {}
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-12-06 12:27:35+0100 [scrapy] INFO: Enabled item pipelines:
2017-12-06 12:27:35+0100 [example] INFO: Spider opened
2017-12-06 12:27:35+0100 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-06 12:27:35+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-06 12:27:35+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
2017-12-06 12:27:38+0100 [example] DEBUG: Crawled (200) <GET https://www.example.com/139th url> (referer: None)
140
2017-12-06 12:27:47+0100 [example] DEBUG: Crawled (200) <GET https://www.example.com/140th url> (referer: None)
141
查看跳过前十几个网址的方式?我显然不能正确理解蜘蛛是如何工作的,是不是在start_requests例程通过txt文件完成计数之前没有初始化(这是我认为实现它的唯一方式)?
奖金问题 - 这个通知是什么?
@kwdSpider.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
感谢。
答案 0 :(得分:0)
我将回答奖金问题,而主要问题正在进行调查。
关于警告:
@kwdSpider.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
在消息之后,将代码更改为:
from scrapy.spider import Spider
from scrapy.http.request import Request
import time
class MySpider(Spider):
至少应该解决这个问题。