我正在尝试抓取并产生我抓取的每个页面的URL。 逻辑已经存在,但是就像我的代码找不到链接。
这是我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
import datetime
import uuid
class QuotesSpider(scrapy.Spider):
name = 'CYR_FINAL'
start_urls = ['https://www.companiesintheuk.co.uk/Company/Find?q=']
def start_requests(self):
# self points to the spider instance
# that was initialized by the scrapy framework when starting a crawl
# spider instances are "augmented" with crawl arguments
# available as instance attributes,
# self.ip has the (string) value passed on the command line
# with `-a ip=somevalue`
for url in self.start_urls:
yield scrapy.Request(url + self.ip, dont_filter=True)
def parse(self, response):
# This part extract the search result item page and makes parse_details method go through it
for company_url in response.xpath('//div[@class="search_result_title"]/a/@href').extract():
yield scrapy.Request(
url=response.urljoin(company_url),
callback=self.parse_details,
)
# Extract subsections from the current section
for section in response.xpath('//a[@id="sic-section-description"]/@href').extract():
yield scrapy.Request(url=response.urljoin(section), callback=self.parse)
# This part scrapes through the next search result page
next_page_url = response.xpath(
'//li/a[@class="pageNavNextLabel"]/@href').extract_first()
if next_page_url:
yield scrapy.Request(
url=response.urljoin(next_page_url),
callback=self.parse,
)
def parse_details(self, response):
# Looping throught the searchResult block and yielding it
for i in response.css('#content2'):
yield {
'company_name': i.css('[itemprop="name"]::text').get(),
'company_registration_no': i.css('#content2 > div:nth-child(6) > div:nth-child(2)::text').extract_first(),
'address': i.css('[itemprop="streetAddress"]::text').extract_first(),
'location': i.css("[itemprop='addressLocality']::text").extract_first(),
'postal_code': i.css("[itemprop='postalCode']::text").extract_first(),
'land_code': i.css("test").extract_first(default="GB"),
'date_time': datetime.datetime.now(),
'uid': str(uuid.uuid4()),
'url': (self.url)
}
我的代码的最后一行是问题。尝试抓住它时我怎么会出错?未定义关键字?
我收到此错误,如您所见,它说它可以找到URL。我使用self关键字无法成功找到它。你们中有人知道为什么吗?
2019-06-18 17:23:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.companiesintheuk.co.uk/ltd/a-d> (referer: https://www.companiesintheuk.co.uk/Company/Find?q=a)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Desktop/Scrapy projects/CompanyUK_V2/CompanyUK/spiders/CYR_FINAL.py", line 68, in parse_details
'url': (self.url)
AttributeError: 'QuotesSpider' object has no attribute 'url'