刮擦元素中的自定义文件名

时间:2019-10-30 05:08:01

标签: python scrapy scrapy-pipeline

我见过Scrapy file download how to use custom filename,我的问题是如何通过将页面中已剪贴的项目追加到文件名中来进一步扩展此代码?我创建了一个保存文件名的项目,但是如何访问该项目或将其传递给file_urls管道?我尝试使用响应来重新标记元素,但是我的方法不在某个地方。

def doc_page(self, response):
    tr = response.xpath('//tr/td/a').attrib['href']
    if tr is not None:
        next_page = response.urljoin(tr)
    # desired preceeding filename    
    filename = 'MSFT_' + \
                   response.xpath('(//tr/td[2]//text()').get() + /
                   '_' + response.xpath('.//div[contains(@class, 
                   "formContent")]').xpath('.//div[contains(@class, "info")]\
                   [2]//text()').get()
        loader = ItemLoader(item=SecScrapeItem(), selector=next_page)
        loader.add_value('file_urls', next_page)
        if filename:
            loader.add_value('myFile', filename)
        yield loader.load_item()

def file_path(self, request, response=None, info=None):
    original_path = super(SecScrapePipeline, self).file_path(request, response=None, info=None)
    sha1_and_extension = original_path.split('/')[1]  # delete 'full/' from the path
    return request.meta.get('filename', '') + item['myFile'] + "_" + sha1_and_extension
def function(foo):
    print(foo)

我希望输出为'MSFT_10-K_2018-08-03'+ SHA1_extension,但我只能得到SHA1_extension

任何想法都是有益的!

0 个答案:

没有答案