如何访问'文件中的值'在scrapy中的字段

时间:2016-03-31 00:34:42

标签: python-2.7 scrapy-spider

我已经使用文件管道下载了一些文件,我想获取文件字段的值。我试图打印项目['文件'],它给了我一个关键错误。为什么会这样,我该怎么做?

class testspider2(CrawlSpider):
name = 'genspider'
URL = 'flu-card.com'
URLhttp = 'http://www.flu-card.com'
allowed_domains = [URL]
start_urls = [URLhttp]
rules = (
    [Rule(LxmlLinkExtractor(allow = (),restrict_xpaths = ('//a'),unique = True,),callback='parse_page',follow=True),]
)

def parse_page(self, response):
    List = response.xpath('//a/@href').extract()
    item = GenericspiderItem()
    date = strftime("%Y-%m-%d %H:%M:%S")#get date&time dd-mm-yyyy hh:mm:ss
    MD5hash = '' #store as part of the item, some links crawled are not file links so they do not have values on these fields
    fileSize = ''
    newFilePath = ''
    File = open('c:/users/kevin123/desktop//ext.txt','a')
    for links in List:
        if re.search('http://www.flu-card.com', links) is None:
            responseurl = re.sub('\/$','',response.url)
            url = urljoin(responseurl,links)
        else:
            url = links
        #File.write(url+'\n')
        filename = url.split('/')[-1]      
        fileExt = ''.join(re.findall('.{3}$',filename))
        if (fileExt != ''):
            blackList = ['tml','pdf','com','php','aspx','xml','doc']
            for word in blackList:
                if any(x in fileExt for x in blackList):
                    pass    #url is blacklisted                               
                else:                    
                    item['filename'] = filename
                    item['URL'] = url
                    item['date'] = date
                    print item['files']
                    File.write(fileExt+'\n')
                    yield GenericspiderItem(
                        file_urls=[url]
                        )
                    yield item

1 个答案:

答案 0 :(得分:0)

无法访问蜘蛛中的item['files']。这是因为FilesPipeline会下载文件,而物品离开蜘蛛后就会到达管道。

您首先产生该项目,然后它到达FilesPipeline,然后文件被下载,然后字段images将填充您想要的信息。要访问它,您必须编写一个管道并在FilesPipeline之后安排它。在您的管道中,您可以访问files字段。

另请注意,在您的蜘蛛中,您正在屈服于不同种类的物品!