我的目标是从网站上自动下载特定的excel文件。
我有一个蜘蛛,可以找到与excel文件相关的链接,并将其产生到管道中。
蜘蛛网中的yield函数是
def extracting_function():
#do something
yield scrapy.Request(url=dl_path, callback=self.download_excel)
def download_excel(self, response):
fn = response.url.split('/')[-1]
self.logger.info('Saving EXL %s', fn)
dl_item = DownloadItem(body=response.body, url=response.url)
yield dl_item
管道解析为:
def process_item(self, item, spider):
dl_url = item['url']
fn = dl_url.split('/')[-1]
self.replace_and_save(item['body'], fn)
return item
def replace_and_save(self, to_save, fn):
o_fn = os.path.join(self.save_path, fn)
##replace
if os.path.exists(o_fn):
old_file = self.get_one_excel_df(o_fn)
new_name = '{0}_{1}'.format(fn, datetime.datetime.today())
with open(new_name, 'wb') as ouf:
pickle.dump(ouf, old_file)
print('Removing old file {0} after creating a backup version of it'.format(o_fn))
os.remove(o_fn)
with open(o_fn, 'wb') as ouf:
ouf.write(to_save)
正在保存文件,但这不是我想要的xlsx文件,而是页面正文(我从中废弃了下载链接)。
我在这里做什么错了?
谢谢