我已经使用文件管道下载了一些文件,我想获取文件字段的值。我试图打印项目['文件'],它给了我一个关键错误。为什么会这样,我该怎么做?
class testspider2(CrawlSpider):
name = 'genspider'
URL = 'flu-card.com'
URLhttp = 'http://www.flu-card.com'
allowed_domains = [URL]
start_urls = [URLhttp]
rules = (
[Rule(LxmlLinkExtractor(allow = (),restrict_xpaths = ('//a'),unique = True,),callback='parse_page',follow=True),]
)
def parse_page(self, response):
List = response.xpath('//a/@href').extract()
item = GenericspiderItem()
date = strftime("%Y-%m-%d %H:%M:%S")#get date&time dd-mm-yyyy hh:mm:ss
MD5hash = '' #store as part of the item, some links crawled are not file links so they do not have values on these fields
fileSize = ''
newFilePath = ''
File = open('c:/users/kevin123/desktop//ext.txt','a')
for links in List:
if re.search('http://www.flu-card.com', links) is None:
responseurl = re.sub('\/$','',response.url)
url = urljoin(responseurl,links)
else:
url = links
#File.write(url+'\n')
filename = url.split('/')[-1]
fileExt = ''.join(re.findall('.{3}$',filename))
if (fileExt != ''):
blackList = ['tml','pdf','com','php','aspx','xml','doc']
for word in blackList:
if any(x in fileExt for x in blackList):
pass #url is blacklisted
else:
item['filename'] = filename
item['URL'] = url
item['date'] = date
print item['files']
File.write(fileExt+'\n')
yield GenericspiderItem(
file_urls=[url]
)
yield item
答案 0 :(得分:0)
无法访问蜘蛛中的item['files']
。这是因为FilesPipeline会下载文件,而物品离开蜘蛛后就会到达管道。
您首先产生该项目,然后它到达FilesPipeline,然后文件被下载,然后字段images
将填充您想要的信息。要访问它,您必须编写一个管道并在FilesPipeline之后安排它。在您的管道中,您可以访问files
字段。
另请注意,在您的蜘蛛中,您正在屈服于不同种类的物品!