在Windows 10上删除csv输出中的空行而没有错误

时间:2018-01-01 00:03:51

标签: python csv web-scraping scrapy

我正在使用Scrapy 1.4.0和Python 3.6.3。当我运行“scrapy crawl -o items.csv”时,csv文件的每隔一行都是空白的。

虽然我找到了解决方案here,但它会生成以下错误:

2017-12-31 18:39:48 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method FeedExporter.item_scraped of 
<scrapy.extensions.feedexport.FeedExporter object at 0x000001FEA491D128>>
Traceback (most recent call last):
  File "c:\programdata\anaconda3\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
result = f(*args, **kw)
  File "c:\programdata\anaconda3\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\extensions\feedexport.py", line 224, in item_scraped
slot.exporter.export_item(item)
  File "c:\programdata\anaconda3\lib\site-packages\scrapy\exporters.py", line 243, in export_item
self.csv_writer.writerow(values)
  File "c:\programdata\anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 77-78: character maps to <undefined>

提前谢谢。

以下是代码:

import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
    name = 'jobs'
    allowed_domains = ['craigslist.org']
    start_urls = ['https://newyork.craigslist.org/search/egr?s=120'] 

def parse(self, response):
    jobs = response.xpath('//p[@class="result-info"]')

    for job in jobs:
        title = job.xpath('a[@class="result-title hdrlnk"]/text()').extract_first()
        address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
        relative_url = job.xpath('a/@href').extract_first()
        yield Request(relative_url, callback=self.parse_page, meta={'URL': relative_url, 'Title': title, 'Address':address}) 

def parse_page(self, response):
    url = response.meta.get('URL')
    title = response.meta.get('Title')
    address = response.meta.get('Address')
    compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
    employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()

    description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract())
    yield{'URL': url, 'Title': title, 'Address':address, 'compensation': compensation,'employment_type':employment_type, 'Description':description}

0 个答案:

没有答案