我是Scrapy和网络抓取的新手。请不要生气。我想刮一下profilecanada.com。现在,当我运行下面的代码时,没有给出任何错误,但我认为它仍然没有刮擦。在我的代码中,我试图在一个有链接列表的页面中开始。每个链接指向一个页面,其中还有另一个链接列表。从该链接是另一个页面,它是我需要提取并保存到json文件中的数据。一般来说,它类似于"嵌套链接抓取"。我不知道它是如何实际调用的。当我咆哮时,请看下面的图片,看看蜘蛛的结果。提前感谢您的帮助。
import scrapy
class ProfilecanadaSpider(scrapy.Spider):
name = 'profilecanada'
allowed_domains = ['http://www.profilecanada.com']
start_urls = ['http://www.profilecanada.com/browse_by_category.cfm/']
def parse(self, response):
# urls in from start_url
category_list_urls = response.css('div.div_category_list > div.div_category_list_column > ul > li.li_category > a::attr(href)').extract()
# start_u = 'http://www.profilecanada.com/browse_by_category.cfm/'
# for each category of company
for url in category_list_urls:
url = url[3:]
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.profileCategoryPages)
def profileCategoryPages(self, response):
company_list_url = response.css('div.dv_en_block_name_frame > a::attr(href)').extract()
# for each company in the list
for url in company_list_url:
url = response.urljoin(url)
return scrapy.Request(url=url, callback=self.companyDetails)
def companyDetails(self, response):
return {
'company_name': response.css('span#name_frame::text').extract_first(),
'street_address': str(response.css('span#frame_addr::text').extract_first()),
'city': str(response.css('span#frame_city::text').extract_first()),
'region_or_province': str(response.css('span#frame_province::text').extract_first()),
'postal_code': str(response.css('span#frame_postal::text').extract_first()),
'country': str(response.css('div.type6_GM > div > div::text')[-1].extract())[2:],
'phone_number': str(response.css('span#frame_phone::text').extract_first()),
'fax_number': str(response.css('span#frame_fax::text').extract_first()),
'email': str(response.css('span#frame_email::text').extract_first()),
'website': str(response.css('span#frame_website > a::attr(href)').extract_first()),
}
CMD中的图像结果: The result in cmd when I ran the spider
答案 0 :(得分:1)
您应该将allowed_domains
更改为allowed_domains = ['profilecanada.com']
,将所有return scrapy.Request
更改为yield scrapy.Request
,然后它就会开始工作,请记住遵守robots.txt是并非总是如此,如果有必要,你应该限制你的请求。