为什么我无法抓取网址?关键字未定义

时间:2019-06-18 15:11:45

标签: python web-scraping scrapy

我正在尝试抓取并产生我抓取的每个页面的URL。 逻辑已经存在,但是就像我的代码找不到链接。

这是我的代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
import datetime
import uuid


class QuotesSpider(scrapy.Spider):

  name = 'CYR_FINAL'
  start_urls = ['https://www.companiesintheuk.co.uk/Company/Find?q=']

  def start_requests(self):
        # self points to the spider instance
        # that was initialized by the scrapy framework when starting a crawl
        # spider instances are "augmented" with crawl arguments
        # available as instance attributes,
        # self.ip has the (string) value passed on the command line
        # with `-a ip=somevalue`
    for url in self.start_urls:
      yield scrapy.Request(url + self.ip, dont_filter=True)

  def parse(self, response):


    # This part extract the search result item page and makes parse_details method go through it
    for company_url in response.xpath('//div[@class="search_result_title"]/a/@href').extract():
      yield scrapy.Request(
          url=response.urljoin(company_url),
          callback=self.parse_details,
      )



    # Extract subsections from the current section
    for section in response.xpath('//a[@id="sic-section-description"]/@href').extract():
      yield scrapy.Request(url=response.urljoin(section), callback=self.parse)

    # This part scrapes through the next search result page
    next_page_url = response.xpath(
        '//li/a[@class="pageNavNextLabel"]/@href').extract_first()
    if next_page_url:
      yield scrapy.Request(
          url=response.urljoin(next_page_url),
          callback=self.parse,
      )

  def parse_details(self, response):

    # Looping throught the searchResult block and yielding it
    for i in response.css('#content2'):
      yield {
          'company_name': i.css('[itemprop="name"]::text').get(),
          'company_registration_no': i.css('#content2 > div:nth-child(6) > div:nth-child(2)::text').extract_first(),
          'address': i.css('[itemprop="streetAddress"]::text').extract_first(),
          'location': i.css("[itemprop='addressLocality']::text").extract_first(),
          'postal_code': i.css("[itemprop='postalCode']::text").extract_first(),
          'land_code': i.css("test").extract_first(default="GB"),
          'date_time': datetime.datetime.now(),
          'uid': str(uuid.uuid4()),
          'url': (self.url)
      }

我的代码的最后一行是问题。尝试抓住它时我怎么会出错?未定义关键字?

我收到此错误,如您所见,它说它可以找到URL。我使用self关键字无法成功找到它。你们中有人知道为什么吗?

2019-06-18 17:23:49 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.companiesintheuk.co.uk/ltd/a-d> (referer: https://www.companiesintheuk.co.uk/Company/Find?q=a)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/Desktop/Scrapy projects/CompanyUK_V2/CompanyUK/spiders/CYR_FINAL.py", line 68, in parse_details
    'url': (self.url)
AttributeError: 'QuotesSpider' object has no attribute 'url'

0 个答案:

没有答案