我正在尝试解析文件,就像this一样,但对于很多经度和纬度。爬虫遍历所有网页,但不输出任何内容。
这是我的代码:
import scrapy
import json
from tutorial.items import DmozItem
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["proadvisorservice.intuit.com"]
min_lat = 35
max_lat = 40
min_long = -100
max_long = -90
def start_requests(self):
for i in range(self.min_lat, self.max_lat):
for j in range(self.min_long, self.max_long):
yield scrapy.Request('http://proadvisorservice.intuit.com/v1/search?latitude=%d&longitude=%d&radius=100&pageNumber=1&pageSize=&sortBy=distance' % (i, j),
meta={'index':(i, j)},
callback=self.parse)
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
for x in jsonresponse['searchResults']:
item = DmozItem()
item['firstName'] = x['firstName']
item['lastName'] = x['lastName']
item['phoneNumber'] = x['phoneNumber']
item['email'] = x['email']
item['companyName'] = x['companyName']
item['qbo'] = x['qbopapCertVersions']
item['qbd'] = x['papCertVersions']
yield item
答案 0 :(得分:1)
使用CrawlSpider
时,不会覆盖parse()
方法:
编写爬网蜘蛛规则时,请避免使用parse作为回调 CrawlSpider使用parse方法本身来实现其逻辑。 因此,如果您覆盖解析方法,则爬行蜘蛛将不再存在 工作。 (source)
但是,既然您是手动自定义蜘蛛,而不是使用CrawlSpider
功能,我建议您不要继承它。相反,继承自scrapy.Spider
:
class DmozSpider(scrapy.Spider):
...