我有一个功能正常的scrapy蜘蛛,现在它只是在一次请求后死亡?我无法弄清楚发生了什么。我完成后输出了完整的输出和我的蜘蛛代码。
jeff@deltaskelta:~/Desktop/hangulscrape/hangulscrape$ scrapy crawl englishwiki -o test.json
2015-01-13 22:20:41+0900 [scrapy] INFO: Scrapy 0.24.4 started (bot: hangulscrape)
2015-01-13 22:20:41+0900 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2015-01-13 22:20:41+0900 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'hangulscrape.spiders', 'FEED_URI': 'test.json', 'SPIDER_MODULES': ['hangulscrape.spiders'], 'BOT_NAME': 'hangulscrape', 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'DEPTH_PRIORITY': 1, 'FEED_FORMAT': 'json', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2015-01-13 22:20:42+0900 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-13 22:20:43+0900 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-13 22:20:43+0900 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-13 22:20:43+0900 [scrapy] INFO: Enabled item pipelines:
2015-01-13 22:20:43+0900 [englishwiki] INFO: Spider opened
2015-01-13 22:20:43+0900 [englishwiki] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-13 22:20:43+0900 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6036
2015-01-13 22:20:43+0900 [scrapy] DEBUG: Web service listening on 127.0.0.1:6093
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Garden_warbler> (referer: None)
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET http://en.wikipedia.org/wiki/Garden_warbler>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.mediawiki.org': <GET https://www.mediawiki.org/wiki/Special:MyLanguage/Extension:TimedMediaHandler/Client_download>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.iucnredlist.org': <GET http://www.iucnredlist.org/details/22716906>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'dx.doi.org': <GET http://dx.doi.org/10.1111%2Fj.1463-6409.2006.00221.x>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.ncbi.nlm.nih.gov': <GET http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1794596>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.birdlife.org': <GET http://www.birdlife.org/datazone/speciesfactsheet.php?id=8074>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.jstor.org': <GET http://www.jstor.org/stable/4454>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'blx1.bto.org': <GET http://blx1.bto.org/birdfacts/results/bob12760.htm>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.webcitation.org': <GET http://www.webcitation.org/6HLrPClx6>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.euring.org': <GET http://www.euring.org/data_and_codes/longevity-voous.htm>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.tandfonline.com': <GET http://www.tandfonline.com/doi/pdf/10.1080/00063657909476637>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.nhm.ac.uk': <GET http://www.nhm.ac.uk/research-curation/scientific-resources/biodiversity/uk-biodiversity/british-flea-distribution/database/Searchpage.do?county=&fleaname=&host=&hostname=Garden+Warbler&listoption=&publication=&search=Search&sortorder=&species=>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.biodiversitylibrary.org': <GET http://www.biodiversitylibrary.org/item/88617>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'oops.uni-oldenburg.de': <GET http://oops.uni-oldenburg.de/214/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'commons.wikimedia.org': <GET http://commons.wikimedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ibc.lynxeds.com': <GET http://ibc.lynxeds.com/species/garden-warbler-sylvia-borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.ornithos.de': <GET http://www.ornithos.de/Ornithos/Feather_Collection/Sylvia_borin/Sylvia_borin.htm>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'donate.wikimedia.org': <GET https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?uselang=en&utm_campaign=C13_en.wikipedia.org&utm_medium=sidebar&utm_source=donate>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'shop.wikimedia.org': <GET http://shop.wikimedia.org/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.wikidata.org': <GET http://www.wikidata.org/wiki/Q202478>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'kbd.wikipedia.org': <GET http://kbd.wikipedia.org/wiki/%D0%92%D1%8D%D0%B4%D0%B3%D1%8A%D1%83%D0%B0%D0%B1%D0%B6%D1%8D>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'af.wikipedia.org': <GET http://af.wikipedia.org/wiki/Tuinsanger>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ar.wikipedia.org': <GET http://ar.wikipedia.org/wiki/%D8%AF%D8%AE%D9%84%D8%A9_%D8%A7%D9%84%D8%A8%D8%B3%D8%A7%D8%AA%D9%8A%D9%86>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ba.wikipedia.org': <GET http://ba.wikipedia.org/wiki/%D2%BA%D0%B0%D2%99_%D0%BA%D0%B8%D0%BB%D0%B5%D0%B9%D0%B5%D0%B3%D0%B5>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'bg.wikipedia.org': <GET http://bg.wikipedia.org/wiki/%D0%93%D1%80%D0%B0%D0%B4%D0%B8%D0%BD%D1%81%D0%BA%D0%BE_%D0%BA%D0%BE%D0%BF%D1%80%D0%B8%D0%B2%D0%B0%D1%80%D1%87%D0%B5>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'br.wikipedia.org': <GET http://br.wikipedia.org/wiki/Devedig-liorzh>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ca.wikipedia.org': <GET http://ca.wikipedia.org/wiki/Tallarol_gros>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ceb.wikipedia.org': <GET http://ceb.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'cs.wikipedia.org': <GET http://cs.wikipedia.org/wiki/P%C4%9Bnice_slav%C3%ADkov%C3%A1>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'cy.wikipedia.org': <GET http://cy.wikipedia.org/wiki/Telor_yr_Ardd>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'da.wikipedia.org': <GET http://da.wikipedia.org/wiki/Havesanger>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'de.wikipedia.org': <GET http://de.wikipedia.org/wiki/Gartengrasm%C3%BCcke>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'et.wikipedia.org': <GET http://et.wikipedia.org/wiki/Aed-p%C3%B5%C3%B5salind>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'es.wikipedia.org': <GET http://es.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'eo.wikipedia.org': <GET http://eo.wikipedia.org/wiki/%C4%9Cardensilvio>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'eu.wikipedia.org': <GET http://eu.wikipedia.org/wiki/Baso-txinbo>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fa.wikipedia.org': <GET http://fa.wikipedia.org/wiki/%D8%A2%D9%84%D9%88%DA%86%D9%87%E2%80%8C%D8%AE%D9%88%D8%B1%DA%A9>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fo.wikipedia.org': <GET http://fo.wikipedia.org/wiki/Gar%C3%B0lj%C3%B3mari>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fr.wikipedia.org': <GET http://fr.wikipedia.org/wiki/Fauvette_des_jardins>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'gl.wikipedia.org': <GET http://gl.wikipedia.org/wiki/Papuxa_picafollas>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'hy.wikipedia.org': <GET http://hy.wikipedia.org/wiki/%D4%B1%D5%B5%D5%A3%D5%B8%D6%82_%D5%B7%D5%A1%D5%B0%D6%80%D5%AB%D5%AF>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'io.wikipedia.org': <GET http://io.wikipedia.org/wiki/Bekafiko>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'it.wikipedia.org': <GET http://it.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'he.wikipedia.org': <GET http://he.wikipedia.org/wiki/%D7%A1%D7%91%D7%9B%D7%99_%D7%90%D7%A4%D7%95%D7%A8>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'kk.wikipedia.org': <GET http://kk.wikipedia.org/wiki/%D0%91%D0%B0%D2%9B_%D1%81%D0%B0%D0%BD%D0%B4%D1%83%D2%93%D0%B0%D1%88>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'kv.wikipedia.org': <GET http://kv.wikipedia.org/wiki/%D0%A1%D1%8D%D1%82%D3%A7%D1%80_%D0%BA%D0%B0%D0%B9>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'lt.wikipedia.org': <GET http://lt.wikipedia.org/wiki/Sodin%C4%97_devynbals%C4%97>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'li.wikipedia.org': <GET http://li.wikipedia.org/wiki/Zengersj_van_de_Ouwe_Waereld>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'hu.wikipedia.org': <GET http://hu.wikipedia.org/wiki/Kerti_posz%C3%A1ta>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'mk.wikipedia.org': <GET http://mk.wikipedia.org/wiki/%D0%93%D1%80%D0%B0%D0%B4%D0%B8%D0%BD%D1%81%D0%BA%D0%BE_%D0%B3%D1%80%D0%BC%D1%83%D1%88%D0%B0%D1%80%D1%87%D0%B5>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ms.wikipedia.org': <GET http://ms.wikipedia.org/wiki/Burung_siul_taman>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'nl.wikipedia.org': <GET http://nl.wikipedia.org/wiki/Tuinfluiter>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'nap.wikipedia.org': <GET http://nap.wikipedia.org/wiki/Fucetula>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'no.wikipedia.org': <GET http://no.wikipedia.org/wiki/Hagesanger>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'nn.wikipedia.org': <GET http://nn.wikipedia.org/wiki/Hagesongar>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ps.wikipedia.org': <GET http://ps.wikipedia.org/wiki/%D9%BC%D8%B1%D8%A7%DA%A9>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'pms.wikipedia.org': <GET http://pms.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'pl.wikipedia.org': <GET http://pl.wikipedia.org/wiki/Gaj%C3%B3wka_(ptak)>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'pt.wikipedia.org': <GET http://pt.wikipedia.org/wiki/Felosa-das-figueiras>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'ru.wikipedia.org': <GET http://ru.wikipedia.org/wiki/%D0%A1%D0%B0%D0%B4%D0%BE%D0%B2%D0%B0%D1%8F_%D1%81%D0%BB%D0%B0%D0%B2%D0%BA%D0%B0>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'sl.wikipedia.org': <GET http://sl.wikipedia.org/wiki/Vrtna_penica>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'fi.wikipedia.org': <GET http://fi.wikipedia.org/wiki/Lehtokerttu>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'sv.wikipedia.org': <GET http://sv.wikipedia.org/wiki/Tr%C3%A4dg%C3%A5rdss%C3%A5ngare>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'tr.wikipedia.org': <GET http://tr.wikipedia.org/wiki/Boz_%C3%B6tle%C4%9Fen>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'uk.wikipedia.org': <GET http://uk.wikipedia.org/wiki/%D0%9A%D1%80%D0%BE%D0%BF%D0%B8%D0%B2'%D1%8F%D0%BD%D0%BA%D0%B0_%D1%81%D0%B0%D0%B4%D0%BE%D0%B2%D0%B0>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'vi.wikipedia.org': <GET http://vi.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'war.wikipedia.org': <GET http://war.wikipedia.org/wiki/Sylvia_borin>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'creativecommons.org': <GET http://creativecommons.org/licenses/by-sa/3.0/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'wikimediafoundation.org': <GET http://wikimediafoundation.org/wiki/Terms_of_Use>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'www.wikimediafoundation.org': <GET http://www.wikimediafoundation.org/>
2015-01-13 22:20:47+0900 [englishwiki] DEBUG: Filtered offsite request to 'en.m.wikipedia.org': <GET http://en.m.wikipedia.org/w/index.php?mobileaction=toggle_view_mobile&title=Garden_warbler>
2015-01-13 22:20:47+0900 [englishwiki] INFO: Closing spider (finished)
2015-01-13 22:20:47+0900 [englishwiki] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 36442,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 13, 13, 20, 47, 424202),
'log_count/DEBUG': 74,
'log_count/INFO': 7,
'offsite/domains': 71,
'offsite/filtered': 289,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 1, 13, 13, 20, 43, 114492)}
2015-01-13 22:20:47+0900 [englishwiki] INFO: Spider closed (finished)
这是蜘蛛代码:
import scrapy
from hangulscrape.items import HangulScrapeItem
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import json
class HangulSpider(CrawlSpider):
name='englishwiki'
allowed_domains = ['en.wikipedia.org/wiki/']
start_urls = [
'http://en.wikipedia.org/wiki/Garden_warbler'
]
rules = (
Rule(SgmlLinkExtractor(), callback='parse_it', follow=True),
)
def parse_it(self, response):
the_item = HangulScrapeItem()
response.body.decode('utf-8')
body = response.xpath('//*[@id="mw-content-text"]//text()').extract()
english_dict = {}
for i in body:
english_words = re.findall('[a-zA-Z\'-]+' ,i)
if english_words:
for j in english_words:
if len(j) > 1:
word = j.lower()
if word in english_dict:
english_dict[word] += 1
else:
english_dict[word] = 1
jsondump = json.dumps(english_dict)
the_item['word'] = jsondump
the_item['site'] = response.url
return the_item
答案 0 :(得分:1)
由于缺少项目,我无法复制您的所有代码。在任何情况下,它都是您的代码的简化版本:
不存在项目和json转换
class testSpider(CrawlSpider):
name='englishwiki'
allowed_domains = ["en.wikipedia.org/wiki/Garden_warbler"]
start_urls = ["http://en.wikipedia.org/wiki/Garden_warbler"]
rules = (Rule(SgmlLinkExtractor(), callback='parse', follow=True),)
def parse(self, response):
response.body.decode('utf-8')
body = response.xpath('//*[@id="mw-content-text"]//text()').extract()
english_dict = {}
for i in body:
english_words = re.findall('[a-zA-Z\'-]+' ,i)
if english_words:
for j in english_words:
if len(j) > 1:
word = j.lower()
if word in english_dict:
english_dict[word] += 1
else:
english_dict[word] = 1
print english_dict
输出:
1,u'times':1,u'length':2,u'south':5,u'upperparts':4,u'isbn': 19,u'evans':1,u'scene':1,u'reaches':1,u'svalbard':1, u'management':1,u'atricapilla':1,u'their':15,u'vocalisation':1, u'intermediate':1,u'zoologica':1,u'shell':1,u'accompany':1, u'july':1,u'ben':2,u'borini':1,u'protista':1,u'sweden':2, 你移民':15,u'clip':2,你有':17,你''':1,u'able':1, u'relatives':1,u'which':13,u'vegetation':2,u'digestive':1, u'sylviae':1,u'alarmed':1,u'class':1,u'afresh':1, u'conspecifics':2,你“dohrn's”:1,你的脾脏':1,你'获胜':1, u'jean-louis':1,u'sylviid':1,u'painting':1,u'phenology':2, u'warblers':31,u'selection':1,u'biebach':2,u'text':1, 你支持':1,u'nagy':1,u'longevity':1,u'fear':1,u'pause':1, u'interspecific':3,u'should':1,u'jan':1,u'bernard':1, u'arabian':1,u'piano':1,u'local':2,u'means':2,u'borin':16, u'areas':6,u'organ':2,你是':1,u'nightingale':1,你'可用': 1,u'mid-september':1,u'edition':1,u'boddaert':7,u'oldenburg': 2,u'placed':2,u'pattern':1,u'southward':2,u'identification':2, u'closed':4,u'bedfordshire':1,u'simms':3,u'kidneys':1, u'publishers':1,u'animalia':1,u'miroslav':1,u'jon':2, u'seventeen':1,u'olga':1,u'april':4,u'sexes':1,u'passing':1, u'grounds':3,u'ch':1,u'cm':6,你正在''方向':1, u'coexistence':1,你“公鸡”:1,你好':1,你'''':1, u'anthelme':1,u'table':1,u'second':1,u'silvia':1,u'quia':1, u'long-tailed':2
答案 1 :(得分:0)
我想如果你删除:
allowed_domains = ['en.wikipedia.org/wiki /']
您将允许蜘蛛加载不是en.wikipedia.org/wiki的域名。
此日志消息显示正在进行的域过滤:
DEBUG:过滤现场请求'dx.doi.org':
它正在过滤异地请求,即不让蜘蛛抓取它们。
答案 2 :(得分:0)
我发现了我的问题。我试图在域部分中指定域的子目录。域和子域似乎没问题,但子目录却没有。
所以(en.wikipedia.org = good)和(en.wikipedia.org/wiki = bad)
指定如何提取链接的正确位置将在提取链接的规则中。