我正在使用scrapy来获取页面上某些url中的内容,类似于这里的问题: Use scrapy to get list of urls, and then scrape content inside those urls
我能够从起始URL(第一个def)中获得subURL,但是,第二个def似乎并没有通过。结果文件为空。我已经在scrapy shell中测试了函数内部的内容,它正在获取我想要的信息,但是当我运行Spider时却没有。
import scrapy
from scrapy.selector import Selector
#from scrapy import Spider
from WheelsOnlineScrapper.items import Dealer
from WheelsOnlineScrapper.url_list import urls
import logging
from urlparse import urljoin
logger = logging.getLogger(__name__)
class WheelsonlinespiderSpider(scrapy.Spider):
logger.info('Spider starting')
name = 'wheelsonlinespider'
rotate_user_agent = True # lives in middleware.py and settings.py
allowed_domains = ["https://wheelsonline.ca"]
start_urls = urls # this list is created in url_list.py
logger.info('URLs retrieved')
def parse(self, response):
subURLs = []
partialURLs = response.css('.directory_name::attr(href)').extract()
for i in partialURLs:
subURLs = urljoin('https://wheelsonline.ca/', i)
yield scrapy.Request(subURLs, callback=self.parse_dealers)
logger.info('Dealer ' + subURLs + ' fetched')
def parse_dealers(self, response):
logger.info('Beginning of page')
dlr = Dealer()
#Extracting the content using css selectors
try:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first()
except TypeError:
dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first()
dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract())
dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first()
logger.info('Dealer fetched ' + dlr['DealerName'])
yield dlr
logger.info('End of page')
答案 0 :(得分:0)
您的allowed_domains
列表包含协议(https
)。它应该仅具有documentation的域名:
allowed_domains = ["wheelsonline.ca"]
此外,您应该在日志中收到一条消息:
URL警告:allowed_domains仅接受域,而不接受URL。忽略allow_domains中的URL条目https://wheelsonline.ca