我正在使用Scrapy 1.4.0和Python 3.6.3。当我运行“scrapy crawl -o items.csv”时,csv文件的每隔一行都是空白的。
虽然我找到了解决方案here,但它会生成以下错误:
2017-12-31 18:39:48 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method FeedExporter.item_scraped of
<scrapy.extensions.feedexport.FeedExporter object at 0x000001FEA491D128>>
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\extensions\feedexport.py", line 224, in item_scraped
slot.exporter.export_item(item)
File "c:\programdata\anaconda3\lib\site-packages\scrapy\exporters.py", line 243, in export_item
self.csv_writer.writerow(values)
File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 77-78: character maps to <undefined>
提前谢谢。
以下是代码:
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = 'jobs'
allowed_domains = ['craigslist.org']
start_urls = ['https://newyork.craigslist.org/search/egr?s=120']
def parse(self, response):
jobs = response.xpath('//p[@class="result-info"]')
for job in jobs:
title = job.xpath('a[@class="result-title hdrlnk"]/text()').extract_first()
address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
relative_url = job.xpath('a/@href').extract_first()
yield Request(relative_url, callback=self.parse_page, meta={'URL': relative_url, 'Title': title, 'Address':address})
def parse_page(self, response):
url = response.meta.get('URL')
title = response.meta.get('Title')
address = response.meta.get('Address')
compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()
description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract())
yield{'URL': url, 'Title': title, 'Address':address, 'compensation': compensation,'employment_type':employment_type, 'Description':description}