从列表中第一个城市获取结果后,Scrapy Spider停止运行

时间:2019-03-16 02:34:14

标签: python loops scrapy web-crawler

我构建了一个刮板以在整个工作站点中运行,并将所有潜在的工作数据保存到 csv 文件中,然后保存到我的MySQL数据库中。由于某些原因,刮板从列表中的第一个城市撤出工作后停止运行。这就是我的意思:

城市列表代码:

Cities = {
    'cities':[  'washingtondc',
                'newyork',
                'sanfrancisco',
                '...',
                '...']
            }

Scrapy Spider Code:

# -*- coding: utf-8 -*-
from city_list import Cities
import scrapy, os, csv, glob, pymysql.cursors

class JobsSpider(scrapy.Spider):
    name = 'jobs'
    c_list = Cities['cities']
    for c in c_list:
        print(f'Searching {c} for jobs...')
        allowed_domains = [f'{c}.jobsite.com']
        start_urls = [f'https://{c}.jobsite.com/search/jobs/']

        def parse(self, response):
            listings = response.xpath('//li[@class="listings-path"]')
            for listing in listings:
                date = listing.xpath('.//*[@class="date-path"]/@datetime').extract_first()
                link = listing.xpath('.//a[@class="link-path"]/@href').extract_first()
                text = listing.xpath('.//a[@class="text-path"]/text()').extract_first()

                yield scrapy.Request(link,
                                    callback=self.parse_listing,
                                    meta={'date': date,
                                        'link': link,
                                        'text': text})

            next_page_url = response.xpath('//a[text()="next-path "]/@href').extract_first()
            if next_page_url:
                yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)

        def parse_listing(self, response):
            date = response.meta['date']
            link = response.meta['link']
            text = response.meta['text']
            compensation = response.xpath('//*[@class="compensation-path"]/span[1]/b/text()').extract_first()
            employment_type = response.xpath('//*[@class="employment-type-path"]/span[2]/b/text()').extract_first()
            images = response.xpath('//*[@id="images-path"]//@src').extract()
            address = response.xpath('//*[@id="address-path"]/text()').extract()

            yield {'date': date,
                'link': link,
                'text': text,
                'compensation': compensation,
                'type': employment_type,
                'images': images,
                'address': address}

        def close(self, reason):
            csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)


            conn = pymysql.connect(host='localhost',
                                user='root',
                                password='**********',
                                db='jobs_database',
                                charset='utf8mb4',
                                cursorclass=pymysql.cursors.DictCursor)

            cur = conn.cursor()
            csv_data = csv.reader(open('jobs.csv'))

            for row in csv_data: 
                cur.execute('INSERT INTO jobs_table(date, link, text, compensation, type, images, address)' 'VALUES(%s, %s, %s, %s, %s, %s, %s)', row)

            conn.commit()
            conn.close()
            print("Done Importing!")

刮板工作正常,但在从华盛顿特区抓取工作并退出后停止运行。

如何解决此问题?

更新- 我将上面的代码更改为

class JobsSpider(scrapy.Spider):
    name = 'jobs'
    allowed_domains = []
    start_urls = []

    def __init__(self, *args, **kwargs):
        super().__init__(self, *args, **kwargs)
        c_list = Cities['cities']
        for c in c_list:
            print(f'Searching {c} for jobs...')
            self.allowed_domains.append(f'{c}.jobsearch.com')
            self.start_urls.append(f'https://{c}.jobsearch.com/search/jobs/')


    def parse(self, response):
        ...

,现在收到“ RecursionError:调用Python对象时超出了最大递归深度”

这是回溯:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1034, in emit
    msg = self.format(record)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 880, in format
    return fmt.format(record)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 619, in format
    record.message = record.getMessage()
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 380, in getMessage
    msg = msg % self.args
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
    return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
    return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
    return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
  [Previous line repeated 479 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

1 个答案:

答案 0 :(得分:3)

第一个问题是您的Spider变量和方法位于for循环内。相反,您需要在__init__()中设置那些成员变量。在不测试其余逻辑的情况下,下面是您需要做什么的粗略想法:

class JobsSpider(scrapy.Spider):
    name = 'jobs'
    # Don't do the for loop here.
    allowed_domains = []
    start_urls = []

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        c_list = Cities['cities']
        for c in c_list:
            self.allowed_domains.append(f'{c}.jobsite.com')
            self.start_urls.append(f'https://{c}.jobsite.com/search/jobs/')

    def parse(self, request):
        # ...

如果在此之后您仍然有问题,请更新您的问题,我将尝试更新答案。


说明问题所在:当您遇到问题中的for循环时,最终将覆盖变量和函数。这是直接在Python Shell中的示例:

>>> class SomeClass:
...     for i in range(3):
...         print(i)
...         value = i
...         def get_value(self):
...             print(self.value)
... 
0
1
2
>>> x = SomeClass()
>>> x.value
2
>>> x.get_value()
2

基本上,甚至在使用该类之前都会执行for循环。因此,这不会使函数多次运行结束,而是多次重新定义。最终结果是您的函数和变量指向最后设置的内容。