抓取更新后,Python3脚本停止工作

时间:2018-11-08 10:24:06

标签: python-3.x encoding scrapy twisted

我正在macOS 10.14.2上使用“自制” python3.7.2scrapy 1.5.1twisted 18.9.0作为python新手,并使用以下脚本下载网站上存档的旧报纸:< / p>

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# A scrapy script to download issues of the Gaceta (1843-1961)

import errno
import json
import os
from datetime import datetime

import scrapy
from scrapy import FormRequest, Request

os.chdir("/Volumes/backup/Archives/Gaceta_Nicaragua_1843-1961") # directory path
print((os.getcwd()))

# date range, format DD/MM/YYYY
start = '01/01/1843' # 01/01/1843
end = '31/12/1860' # 31/12/1961

date_format = '%d/%m/%Y'
start = datetime.strptime(start, date_format)
end = datetime.strptime(end, date_format)

class AsambleaSpider(scrapy.Spider):
    name = 'asamblea'
    allowed_domains = ['asamblea.gob.ni']
    start_urls = ['http://digesto.asamblea.gob.ni/consultas/coleccion/']

    papers = {
        "Diario Oficial": "28",
    }

    def parse(self, response):

        for key, value in list(self.papers.items()):
            yield FormRequest(url='http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php',
                  headers= {
                      'X-Requested-With': 'XMLHttpRequest'
                  }, formdata= {
                        'hddQueryType': 'initgetRdds',
                        'cole': value
                    }
                    , meta={'paper': key},
                    callback=self.parse_rdds
                )
        pass

    def parse_rdds(self, response):
        data = json.loads(response.body_as_unicode())
        for r in data["rdds"]:
            if not r['fecPublica']:
                continue

            r_date = datetime.strptime(r['fecPublica'], date_format)

            if start <= r_date <= end:
                r['paper'] = response.meta['paper']
                rddid = r['rddid']
                yield Request("http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=" + rddid,
                              callback=self.download_pdf, meta=r)

    def download_pdf(self, response):
       filename = "{paper}/{anio}/".format(**response.meta) + "{titulo}-{fecPublica}.pdf".format(**response.meta).replace("/", "_")
       if not os.path.exists(os.path.dirname(filename)):
           try:
               os.makedirs(os.path.dirname(filename))
           except OSError as exc:  # guard against race condition
               if exc.errno != errno.EEXIST:
                   raise

       with open(filename, 'wb') as f:
           f.write(response.body)

它运行得很好(尽管很慢),但是,脚本存在两个持久性问题。

首先,自更新以来出现以下错误:

2019-01-07 11:53:34 [scrapy.core.scraper] ERROR: Spider error processing <POST http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php> (referer: http://digesto.asamblea.gob.ni/consultas/coleccion/)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "gaceta_downloader.py", line 58, in parse_rdds
    if not r['fecPublica']:
KeyError: 'fecPublica'

其次,一旦脚本再次运行(就像几天前更新python和软件包之前所做的那样),我遇到了一个问题,脚本有时会抱怨UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xb0’ in position 27: ordinal not in range(128),我猜有时让零字节文件。您在源代码中看到编码错误了吗?这与上述问题有关吗?

0 个答案:

没有答案