我正尝试从此网站(www.canadianlawlist.com)上刮取有关每个公司的信息
我已经完成了大部分,但是遇到了一个小问题。
我正在尝试按以下顺序显示结果:
-Firm Name and Information
*Employees from the firm Information.
但是我得到的是非常随机的结果。
它将刮擦有关2家公司的信息,然后刮擦员工的信息。像这样:
-Firm Name and Information
-Firm name and information
*Employee from Firm 1
-Firm name and information
*Employee from Firm 2
类似的事情。我不确定我的代码中缺少什么:
def parse_after_submit(self, response):
basicurl = "canadianlawlist.com/"
products = response.xpath('//*[@class="searchresult_item_regular"]/a/@href').extract()
for p in products:
url = "http://canadianlawlist.com" + p
yield scrapy.Request(url, callback=self.parse_firm_info)
#process next page
#for x in range(2, 6):
# next_page_url = "https://www.canadianlawlist.com/searchresult?searchtype=firms&city=montreal&page=" + str(x)
def parse_firm_info(self,response):
name = response.xpath('//div[@class="listingdetail_companyname"]/h1/span/text()').extract_first()
print name
for info in response.xpath('//*[@class="listingdetail_contactinfo"]'):
street_address = info.xpath('//div[@class="listingdetail_contactinfo"]/div[1]/span/div/text()').extract_first()
city = info.xpath('//*[@itemprop="addressLocality"]/text()').extract_first(),
province = info.xpath('//*[@itemprop="addressRegion"]/text()').extract_first(),
postal_code = info.xpath('//*[@itemprop="postalCode"]/text()').extract_first(),
telephone = info.xpath('//*[@itemprop="telephone"]/text()').extract_first(),
fax_number = info.xpath('//*[@itemprop="faxNumber"]/text()').extract_first(),
email = info.xpath('//*[@itemprop="email"]/text()').extract_first(),
print street_address
print city
print province
print postal_code
print telephone
print fax_number
print email
for people in response.xpath('////div[@id="main_block"]/div[1]/div[2]/div[2]'):
pname = people.xpath('//*[@class="listingdetail_individual_item"]/h3/a/text()').extract()
print pname
basicurl = "canadianlawlist.com/"
employees = response.xpath('//*[@class="listingdetail_individual_item"]/h3/a/@href').extract()
for e in employees:
url2 = "http://canadianlawlist.com" + e
yield scrapy.Request(url2, callback=self.parse_employe_info)
def parse_employe_info(self,response):
ename = response.xpath('//*[@class="listingdetail_individualname"]/h1/span/text()').extract_first()
job_title = response.xpath('//*[@class="listingdetail_individualmaininfo"]/div/i/span/text()').extract_first()
print ename
print job_title
答案 0 :(得分:0)
在进行并行编程时,您不能依赖Python print
函数的顺序。如果您关心标准的输出顺序,则需要使用logging
模块。
Scrapy具有Spider
类中的快捷功能:
import scrapy
import logging
class MySpider(scrapy.Spider):
def parse(self, response):
self.log("first message", level=logging.INFO)
self.log("second message", level=logging.INFO)
答案 1 :(得分:0)
Scrapy同时运行多个请求,因此控制台上显示的内容可以对应于同时运行的多个请求中的任何一个。 您可以转到settings.py并设置
CONCURRENT_REQUESTS = 1
现在一次只能启动一个请求,因此您的控制台将显示有意义的数据,但这会使抓取速度变慢。