我认为蜘蛛的正确实现会覆盖内置函数中的两个。
parse_start_url()和parse()
当我运行蜘蛛时,自定义重写的parse()函数被注释掉了。 蜘蛛运行正常,使用SgmlLinkExtractor聚合链接,一切正常。
但是当我取消注释自定义parse()函数时,spider运行时没有错误,但是没有输出,所以它必须是函数之间的请求和响应的处理。好的。
我实际上花了几个小时试图让它工作,使用不同的方法覆盖函数/使用InitSpider / BaseSpider结构等。似乎没有任何东西似乎正确设置cookie。
我的版本是0.16.4,这是旧的,所以也许那里有问题?
*已解决* 没关系,我只是用深呼吸和一点点运气解决了它。 使用CrawlSpider,SgmlLinkExtractor()和&amp ;;重新审视了“没有中间件”的方法。重写make_requests_from_url()
所以我删除了应该覆盖parse()的代码块, 并补充说:
def make_requests_from_url(self, url):
request = Request(url, cookies = {'somedomain.com.au+2':'national'}, dont_filter=True)
return request
蜘蛛:
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.contrib.spiders import Rule,CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.shell import inspect_response
from scrapy.http.cookies import CookieJar
from TM.items import TMItem
import json
import time
import datetime
import re
import sys
import os
COOKIES_DEBUG = True
COOKIES_ENABLED = True
SPIDER_NAME = "TKComAuSpider"
SPIDER_VERSION = "1.0"
class TKComAuSpider(CrawlSpider):
name = "TKComAuMusicSpecific"
allowed_domains = ["domain.com.au"]
global response_urls
response_urls = []
global site_section_category
global master_items
master_items = []
start_urls = [ "http://some.domain.com.au/shows/genre.aspx?c=2048" ]
rules = (Rule (SgmlLinkExtractor(allow=(".*page=[0-9]+.*", ),
restrict_xpaths=('//*[@id="ctl00_uiBodyMain_searchResultsControl_uiPaginateBottom_List"]/ul/li',))
, callback="parse_it", follow = True),
)
def parse(self, response):
request_with_cookies = Request(url=self.start_urls[0],cookies={'domain.com.au+2':'national'})
print '\n\n' + request_with_cookies.url + '\n\n'
yield request_with_cookies
def parse_start_url(self, response):
list(self.parse_it(response))
def parse_it(self, response):
spider_name = "TKComAuMusicSpecific"
doc_date = datetime.datetime.now().strftime("%d-%m-%y-%H:%M")
items = []
hxs = HtmlXPathSelector(response)
# RESPONSE ASSIGNMENT #
response_url = response.url
response_urls.append(response_url)
# cl = response.headers.getlist('Cookie')
# if cl:
# msg = "Sending cookies to: %s" % response_url + os.linesep
# msg += os.linesep.join("Cookie: %s" % c for c in cl)
# log.msg(msg, spider=spider, level=log.DEBUG)
# CUSTOM SITE_SECTION TO CREATE SPIDER CAT FROM RESPONSE_URL #
site_section_category = re.sub(r'^.*//[a-zA-Z0-9._-]+([^.?]+).*$',r'\1', response.url).title().replace('/', '')
spider_category = "TKTerms" + site_section_category
file_name = 'out/' + spider_category + ".out"
with open("log/response.log", 'a') as l:
l.write(doc_date + ' ' + ' spider: ' + spider_name + '\nresponse_url: ' + response_url
+ '\nsite_section_category: ' + site_section_category
+ '\nspider_category: ' + spider_category + '\n')
f = open(file_name, 'w')
for site in hxs.select('//*[@class="contentEvent"]'):
link = site.select('h6/a/@href').extract()
title = site.select('h6/a/text()').extract()
f.write("%s\n" % title)
master_items.append({"title":title[0],"item_type":spider_category})
yield TMItem(title=title[0],item_type=spider_category)
f.close()
json_out = 'json/' + spider_name + '.json'
f = open(json_out, 'w')
final_json = (json.dumps({"docs": [{"spider_name": SPIDER_NAME, "spider_version": SPIDER_VERSION},
{"doc_title": spider_name, "doc_date": doc_date,
"urls": response_urls}, master_items]}))
f.write(final_json)
f.close()
2014-04-30 13:15:46+1000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-04-30 13:15:46+1000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-04-30 13:15:46+1000 [scrapy] DEBUG: Enabled item pipelines: JsonWriterPipelineLines, JsonWriterPipeline
2014-04-30 13:15:46+1000 [TKComAuMusicSpecific] INFO: Spider opened
2014-04-30 13:15:46+1000 [TKComAuMusicSpecific] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-04-30 13:15:46+1000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6024
2014-04-30 13:15:46+1000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2014-04-30 13:15:46+1000 [TKComAuMusicSpecific] DEBUG: Redirecting (302) to <GET http://www.some.com.au/detection.aspx?rt=http%3a%2f%2fsome.domain.com.au%2fshows%2fgenre.aspx%3fc%3d2048> from <GET http://some.domain.com.au/shows/genre.aspx?c=2048>
2014-04-30 13:15:46+1000 [TKComAuMusicSpecific] DEBUG: Redirecting (302) to <GET http://some.domain.com.au/shows/genre.aspx?c=2048> from <GET http://www.some.com.au/detection.aspx?rt=http%3a%2f%2fsome.domain.com.au%2fshows%2fgenre.aspx%3fc%3d2048>
2014-04-30 13:15:46+1000 [TKComAuMusicSpecific] DEBUG: Crawled (200) <GET http://some.domain.com.au/shows/genre.aspx?c=2048> (referer: None)
http://some.domain.com.au/shows/genre.aspx?c=2048
2014-04-30 13:15:47+1000 [TKComAuMusicSpecific] DEBUG: Crawled (200) <GET http://some.domain.com.au/shows/genre.aspx?c=2048> (referer: http://some.domain.com.au/shows/genre.aspx?c=2048)
http://some.domain.com.au/shows/genre.aspx?c=2048
2014-04-30 13:15:47+1000 [TKComAuMusicSpecific] INFO: Closing spider (finished)
2014-04-30 13:15:47+1000 [TKComAuMusicSpecific] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1260,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 146364,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 4, 30, 3, 15, 47, 108720),
'log_count/DEBUG': 10,
'log_count/INFO': 4,
'request_depth_max': 2,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2014, 4, 30, 3, 15, 46, 220003)}
2014-04-30 13:15:47+1000 [TKComAuMusicSpecific] INFO: Spider closed (finished)