Question

当我从命令行调用它时，我有这个刮刀工作得很好。等，

scrapy crawl generic

这就是我刮刀的样子。

import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]


    def parse_item(self,response):
        extract some data and store it somewhere

我试图从python脚本中使用这个蜘蛛。我按照文档http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

进行了操作

这就是脚本的样子，

from scrapy.settings import Settings
from scrapy.crawler import CrawlerProcess
import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]




    def parse_item(self,response):
        extract some data and store it somewhere

settings=Settings()
settings.set('DEPTH_LIMIT',1)

process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()

这是我从脚本

运行时在终端上看到的内容

Desktop $ python newspider.py  
2015-10-14 21:46:39 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-14 21:46:39 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 21:46:39 [scrapy] INFO: Overridden settings: {'DEPTH_LIMIT': 1}
2015-10-14 21:46:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 21:46:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 21:46:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 21:46:39 [scrapy] INFO: Enabled item pipelines: 
2015-10-14 21:46:39 [scrapy] INFO: Spider opened
2015-10-14 21:46:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 21:46:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 21:46:39 [scrapy] DEBUG: Redirecting (302) to <GET http://thevine.com.au/> from <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'thevine.com.au': <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET http://www.twitter.com/thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?u=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/intent/tweet?text=Leonardo+DiCaprio+is+Producing+A+Movie+About+The+Volkswagen+Emissions+Scandal&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F&via=thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET http://plus.google.com/share?url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'pinterest.com': <GET http://pinterest.com/pin/create/button/?media=http%3A%2F%2Fs3-ap-southeast-2.amazonaws.com%2Fthevine-online%2Fwp-content%2Fuploads%2F2015%2F10%2F13202447%2FScreen-Shot-2015-10-14-at-7.24.25-AM.jpg&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] INFO: Closing spider (finished)
2015-10-14 21:46:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 28536,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 14, 16, 16, 41, 270707),
 'log_count/DEBUG': 10,
 'log_count/INFO': 7,
 'offsite/domains': 7,
 'offsite/filtered': 139,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 10, 14, 16, 16, 39, 454120)}

在这种情况下，start_url为http://thevine.com.au/和allowed_domains：thevine.com.au
当给作为scrapy项目运行的蜘蛛提供相同的starturl和domain时，会给出这个，

$ scrapy crawl generic -a start="http://thevine.com.au/" -a domains="thevine.com.au"
2015-10-14 22:14:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: mary)
2015-10-14 22:14:45 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 22:14:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mary.spiders', 'SPIDER_MODULES': ['mary.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'mary'}
2015-10-14 22:14:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 22:14:46 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 22:14:46 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 22:14:46 [scrapy] INFO: Enabled item pipelines:
2015-10-14 22:14:46 [scrapy] INFO: Spider opened
2015-10-14 22:14:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 22:14:46 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 22:14:47 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 22:14:47 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
.
.
2015-10-14 22:14:48 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/category/entertainment/> (referer: http://thevine.com.au/)

2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/ 
2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/viral/
.
.

2015-10-14 22:16:10 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/gear/tech/elon-musk-plans-to-launch-4000-satellites-to-bring-wi-fi-to-most-remote-locations-on-earth/> (referer: http://thevine.com.au/)  
2015-10-14 22:19:31 [scrapy] INFO: Crawled 26 pages (at 16 pages/min), scraped 0 items (at 0 items/min)

等等，它一直在继续。

所以基本上这就是我对从脚本运行时发生的事情的理解根本没有遵循Rule。我的parse_item回调无法正常工作。除默认parse以外的任何回调都无法正常工作。它只抓取start_urls中的网址，只回调默认解析方法，如果包含在内。

Answer 1

您需要将Spider类的实例传递给Sheets("Sheet2").Range ("A1" & Cells(Rows.Count).Row).End(xlUp).Offset(1, 0).Value = Sheets("Sheet1").Range("A1").Value方法。

.crawl

但它仍然可以正常运作。

日志显示您正在进行异地请求，尝试从Spider中删除... spider = MySpider() process.crawl(spider) ...（如果您不关心它），但您也可以在allowed_domains上传递域名：

process.crawl

CrawlSpider在脚本中使用时不遵循已定义的规则

1 个答案: