Question

我的目标是从网站上自动下载特定的excel文件。

我有一个蜘蛛，可以找到与excel文件相关的链接，并将其产生到管道中。

蜘蛛网中的yield函数是

def extracting_function():
    #do something  
    yield scrapy.Request(url=dl_path, callback=self.download_excel)

def download_excel(self, response):
    fn = response.url.split('/')[-1]
    self.logger.info('Saving EXL %s', fn)
    dl_item = DownloadItem(body=response.body, url=response.url)
    yield dl_item

管道解析为：

def process_item(self, item, spider):
    dl_url = item['url']
    fn = dl_url.split('/')[-1]
    self.replace_and_save(item['body'], fn)
    return item

def replace_and_save(self, to_save, fn):
    o_fn = os.path.join(self.save_path, fn)
    ##replace
    if os.path.exists(o_fn):
        old_file = self.get_one_excel_df(o_fn)
        new_name = '{0}_{1}'.format(fn, datetime.datetime.today())
        with open(new_name, 'wb') as ouf:
            pickle.dump(ouf, old_file)
        print('Removing old file {0} after creating a backup version of it'.format(o_fn))
        os.remove(o_fn)

    with open(o_fn, 'wb') as ouf:
        ouf.write(to_save)

正在保存文件，但这不是我想要的xlsx文件，而是页面正文（我从中废弃了下载链接）。

我在这里做什么错了？

谢谢

从网站抓取excel下载

0 个答案: