我正在尝试从网站下载pdf,我按照scrapy网站提供的说明但是我收到了这个错误:
File "/home/joseph/ENV/lib/python3.5/site-packages/scrapy/http/request/__init__.py", line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2017-09-12 17:47:40 [scrapy.core.scraper] ERROR: Error processing {'file_urls': 'https://www.sec.gov/divisions/corpfin/cf-noaction/2008/jpmorgan080409-405.pdf',
'title': ('JPMorgan Chase & Co.',)}
Settings.py
ITEM_PIPELINES = {
'sec_scrape.pipelines.SecScrapePipeline': 300,
'sec_scrape.pipelines.JsonWriterPipeline': 800,
'scrapy.pipelines.files.FilesPipeline': 1,
}
FILES_STORE = '/home/joseph/pdf'
Items.py
import scrapy
class LetterItem(scrapy.Item):
title = scrapy.Field()
file_urls = scrapy.Field()
files = scrapy.Field()
spider.py
import scrapy
from sec_scrape.items import LetterItem
class QuotesSpider(scrapy.Spider):
name = "corporate_finance"
allowed_domains = ["sec.gov"]
start_urls = ['https://www.sec.gov/divisions/corpfin/cf-noaction.shtml']
def parse(self, response):
for letter in response.xpath('//table[2]/tr/td[3]/ul[74]/li/a'):
item = LetterItem()
item['title'] = letter.xpath('text()').extract_first(),
item['file_urls'] = response.urljoin(letter.xpath('@href').extract_first())
yield item
我知道为什么会收到此错误?
谢谢
答案 0 :(得分:1)
file_urls
项属性必须是列表,而您将其设置为字符串(要下载的文件的URL)。更改行
item['file_urls'] = response.urljoin(letter.xpath('@href').extract_first())
到
item['file_urls'] = [response.urljoin(letter.xpath('@href').extract_first())]