我构建了一个刮板以在整个工作站点中运行,并将所有潜在的工作数据保存到 csv 文件中,然后保存到我的MySQL数据库中。由于某些原因,刮板从列表中的第一个城市撤出工作后停止运行。这就是我的意思:
Cities = {
'cities':[ 'washingtondc',
'newyork',
'sanfrancisco',
'...',
'...']
}
# -*- coding: utf-8 -*-
from city_list import Cities
import scrapy, os, csv, glob, pymysql.cursors
class JobsSpider(scrapy.Spider):
name = 'jobs'
c_list = Cities['cities']
for c in c_list:
print(f'Searching {c} for jobs...')
allowed_domains = [f'{c}.jobsite.com']
start_urls = [f'https://{c}.jobsite.com/search/jobs/']
def parse(self, response):
listings = response.xpath('//li[@class="listings-path"]')
for listing in listings:
date = listing.xpath('.//*[@class="date-path"]/@datetime').extract_first()
link = listing.xpath('.//a[@class="link-path"]/@href').extract_first()
text = listing.xpath('.//a[@class="text-path"]/text()').extract_first()
yield scrapy.Request(link,
callback=self.parse_listing,
meta={'date': date,
'link': link,
'text': text})
next_page_url = response.xpath('//a[text()="next-path "]/@href').extract_first()
if next_page_url:
yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)
def parse_listing(self, response):
date = response.meta['date']
link = response.meta['link']
text = response.meta['text']
compensation = response.xpath('//*[@class="compensation-path"]/span[1]/b/text()').extract_first()
employment_type = response.xpath('//*[@class="employment-type-path"]/span[2]/b/text()').extract_first()
images = response.xpath('//*[@id="images-path"]//@src').extract()
address = response.xpath('//*[@id="address-path"]/text()').extract()
yield {'date': date,
'link': link,
'text': text,
'compensation': compensation,
'type': employment_type,
'images': images,
'address': address}
def close(self, reason):
csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)
conn = pymysql.connect(host='localhost',
user='root',
password='**********',
db='jobs_database',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
cur = conn.cursor()
csv_data = csv.reader(open('jobs.csv'))
for row in csv_data:
cur.execute('INSERT INTO jobs_table(date, link, text, compensation, type, images, address)' 'VALUES(%s, %s, %s, %s, %s, %s, %s)', row)
conn.commit()
conn.close()
print("Done Importing!")
刮板工作正常,但在从华盛顿特区抓取工作并退出后停止运行。
如何解决此问题?
更新- 我将上面的代码更改为
class JobsSpider(scrapy.Spider):
name = 'jobs'
allowed_domains = []
start_urls = []
def __init__(self, *args, **kwargs):
super().__init__(self, *args, **kwargs)
c_list = Cities['cities']
for c in c_list:
print(f'Searching {c} for jobs...')
self.allowed_domains.append(f'{c}.jobsearch.com')
self.start_urls.append(f'https://{c}.jobsearch.com/search/jobs/')
def parse(self, response):
...
,现在收到“ RecursionError:调用Python对象时超出了最大递归深度”
这是回溯:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1034, in emit
msg = self.format(record)
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 880, in format
return fmt.format(record)
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 619, in format
record.message = record.getMessage()
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 380, in getMessage
msg = msg % self.args
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
[Previous line repeated 479 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object
答案 0 :(得分:3)
第一个问题是您的Spider变量和方法位于for循环内。相反,您需要在__init__()
中设置那些成员变量。在不测试其余逻辑的情况下,下面是您需要做什么的粗略想法:
class JobsSpider(scrapy.Spider):
name = 'jobs'
# Don't do the for loop here.
allowed_domains = []
start_urls = []
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
c_list = Cities['cities']
for c in c_list:
self.allowed_domains.append(f'{c}.jobsite.com')
self.start_urls.append(f'https://{c}.jobsite.com/search/jobs/')
def parse(self, request):
# ...
如果在此之后您仍然有问题,请更新您的问题,我将尝试更新答案。
说明问题所在:当您遇到问题中的for循环时,最终将覆盖变量和函数。这是直接在Python Shell中的示例:
>>> class SomeClass:
... for i in range(3):
... print(i)
... value = i
... def get_value(self):
... print(self.value)
...
0
1
2
>>> x = SomeClass()
>>> x.value
2
>>> x.get_value()
2
基本上,甚至在使用该类之前都会执行for循环。因此,这不会使函数多次运行结束,而是多次重新定义。最终结果是您的函数和变量指向最后设置的内容。