当我从命令行调用它时,我有这个刮刀工作得很好。等,
scrapy crawl generic
这就是我刮刀的样子。
import scrapy
from scrapy.spiders import Rule,CrawlSpider
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name='generic'
rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)
start_urls=["someurl"]
allowed_domains=["somedomain"]
def parse_item(self,response):
extract some data and store it somewhere
我试图从python脚本中使用这个蜘蛛。我按照文档http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
进行了操作这就是脚本的样子,
from scrapy.settings import Settings
from scrapy.crawler import CrawlerProcess
import scrapy
from scrapy.spiders import Rule,CrawlSpider
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name='generic'
rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)
start_urls=["someurl"]
allowed_domains=["somedomain"]
def parse_item(self,response):
extract some data and store it somewhere
settings=Settings()
settings.set('DEPTH_LIMIT',1)
process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()
这是我从脚本
运行时在终端上看到的内容Desktop $ python newspider.py
2015-10-14 21:46:39 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-14 21:46:39 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 21:46:39 [scrapy] INFO: Overridden settings: {'DEPTH_LIMIT': 1}
2015-10-14 21:46:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 21:46:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 21:46:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 21:46:39 [scrapy] INFO: Enabled item pipelines:
2015-10-14 21:46:39 [scrapy] INFO: Spider opened
2015-10-14 21:46:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 21:46:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 21:46:39 [scrapy] DEBUG: Redirecting (302) to <GET http://thevine.com.au/> from <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'thevine.com.au': <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET http://www.twitter.com/thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?u=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/intent/tweet?text=Leonardo+DiCaprio+is+Producing+A+Movie+About+The+Volkswagen+Emissions+Scandal&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F&via=thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET http://plus.google.com/share?url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'pinterest.com': <GET http://pinterest.com/pin/create/button/?media=http%3A%2F%2Fs3-ap-southeast-2.amazonaws.com%2Fthevine-online%2Fwp-content%2Fuploads%2F2015%2F10%2F13202447%2FScreen-Shot-2015-10-14-at-7.24.25-AM.jpg&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] INFO: Closing spider (finished)
2015-10-14 21:46:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 28536,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 14, 16, 16, 41, 270707),
'log_count/DEBUG': 10,
'log_count/INFO': 7,
'offsite/domains': 7,
'offsite/filtered': 139,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 10, 14, 16, 16, 39, 454120)}
在这种情况下,start_url为http://thevine.com.au/和allowed_domains:thevine.com.au
当给作为scrapy项目运行的蜘蛛提供相同的starturl和domain时,会给出这个,
$ scrapy crawl generic -a start="http://thevine.com.au/" -a domains="thevine.com.au"
2015-10-14 22:14:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: mary)
2015-10-14 22:14:45 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 22:14:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mary.spiders', 'SPIDER_MODULES': ['mary.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'mary'}
2015-10-14 22:14:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 22:14:46 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 22:14:46 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 22:14:46 [scrapy] INFO: Enabled item pipelines:
2015-10-14 22:14:46 [scrapy] INFO: Spider opened
2015-10-14 22:14:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 22:14:46 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 22:14:47 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 22:14:47 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
.
.
2015-10-14 22:14:48 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/category/entertainment/> (referer: http://thevine.com.au/)
2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/
2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/viral/
.
.
2015-10-14 22:16:10 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/gear/tech/elon-musk-plans-to-launch-4000-satellites-to-bring-wi-fi-to-most-remote-locations-on-earth/> (referer: http://thevine.com.au/)
2015-10-14 22:19:31 [scrapy] INFO: Crawled 26 pages (at 16 pages/min), scraped 0 items (at 0 items/min)
等等,它一直在继续。
所以基本上这就是我对从脚本运行时发生的事情的理解
根本没有遵循Rule
。我的parse_item
回调无法正常工作。除默认parse
以外的任何回调都无法正常工作。它只抓取start_urls
中的网址,只回调默认解析方法,如果包含在内。
答案 0 :(得分:2)
您需要将Spider类的实例传递给Sheets("Sheet2").Range ("A1" & Cells(Rows.Count).Row).End(xlUp).Offset(1, 0).Value = Sheets("Sheet1").Range("A1").Value
方法。
.crawl
但它仍然可以正常运作。
日志显示您正在进行异地请求,尝试从Spider中删除...
spider = MySpider()
process.crawl(spider)
...
(如果您不关心它),但您也可以在allowed_domains
上传递域名:
process.crawl