CrawlSpider在脚本中使用时不遵循已定义的规则

时间:2015-10-14 14:05:11

标签: python-2.7 web-scraping scrapy

当我从命令行调用它时,我有这个刮刀工作得很好。等,

scrapy crawl generic

这就是我刮刀的样子。

import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]


    def parse_item(self,response):
        extract some data and store it somewhere

我试图从python脚本中使用这个蜘蛛。我按照文档http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

进行了操作

这就是脚本的样子,

from scrapy.settings import Settings
from scrapy.crawler import CrawlerProcess
import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]




    def parse_item(self,response):
        extract some data and store it somewhere

settings=Settings()
settings.set('DEPTH_LIMIT',1)

process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()

这是我从脚本

运行时在终端上看到的内容
Desktop $ python newspider.py  
2015-10-14 21:46:39 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-14 21:46:39 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 21:46:39 [scrapy] INFO: Overridden settings: {'DEPTH_LIMIT': 1}
2015-10-14 21:46:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 21:46:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 21:46:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 21:46:39 [scrapy] INFO: Enabled item pipelines: 
2015-10-14 21:46:39 [scrapy] INFO: Spider opened
2015-10-14 21:46:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 21:46:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 21:46:39 [scrapy] DEBUG: Redirecting (302) to <GET http://thevine.com.au/> from <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'thevine.com.au': <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET http://www.twitter.com/thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?u=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/intent/tweet?text=Leonardo+DiCaprio+is+Producing+A+Movie+About+The+Volkswagen+Emissions+Scandal&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F&via=thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET http://plus.google.com/share?url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'pinterest.com': <GET http://pinterest.com/pin/create/button/?media=http%3A%2F%2Fs3-ap-southeast-2.amazonaws.com%2Fthevine-online%2Fwp-content%2Fuploads%2F2015%2F10%2F13202447%2FScreen-Shot-2015-10-14-at-7.24.25-AM.jpg&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] INFO: Closing spider (finished)
2015-10-14 21:46:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 28536,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 14, 16, 16, 41, 270707),
 'log_count/DEBUG': 10,
 'log_count/INFO': 7,
 'offsite/domains': 7,
 'offsite/filtered': 139,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 10, 14, 16, 16, 39, 454120)}

在这种情况下,start_url为http://thevine.com.au/和allowed_domains:thevine.com.au
当给作为scrapy项目运行的蜘蛛提供相同的starturl和domain时,会给出这个,

$ scrapy crawl generic -a start="http://thevine.com.au/" -a domains="thevine.com.au"
2015-10-14 22:14:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: mary)
2015-10-14 22:14:45 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 22:14:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mary.spiders', 'SPIDER_MODULES': ['mary.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'mary'}
2015-10-14 22:14:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 22:14:46 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 22:14:46 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 22:14:46 [scrapy] INFO: Enabled item pipelines:
2015-10-14 22:14:46 [scrapy] INFO: Spider opened
2015-10-14 22:14:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 22:14:46 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 22:14:47 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 22:14:47 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
.
.
2015-10-14 22:14:48 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/category/entertainment/> (referer: http://thevine.com.au/)

2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/ 
2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/viral/
.
.

2015-10-14 22:16:10 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/gear/tech/elon-musk-plans-to-launch-4000-satellites-to-bring-wi-fi-to-most-remote-locations-on-earth/> (referer: http://thevine.com.au/)  
2015-10-14 22:19:31 [scrapy] INFO: Crawled 26 pages (at 16 pages/min), scraped 0 items (at 0 items/min)

等等,它一直在继续。

所以基本上这就是我对从脚本运行时发生的事情的理解 根本没有遵循Rule。我的parse_item回调无法正常工作。除默认parse以外的任何回调都无法正常工作。它只抓取start_urls中的网址,只回调默认解析方法,如果包含在内。

1 个答案:

答案 0 :(得分:2)

您需要将Spider类的实例传递给Sheets("Sheet2").Range ("A1" & Cells(Rows.Count).Row).End(xlUp).Offset(1, 0).Value = Sheets("Sheet1").Range("A1").Value 方法。

.crawl

但它仍然可以正常运作。

日志显示您正在进行异地请求,尝试从Spider中删除... spider = MySpider() process.crawl(spider) ... (如果您不关心它),但您也可以在allowed_domains上传递域名:

process.crawl