我正在研究scrapy蜘蛛,尝试使用slate(https://pypi.python.org/pypi/slate)在目录中提取多个pdf文本到目前为止我有:
class Ove_Spider(BaseSpider):
name = "ove"
allowed_domains = ['myurl.com']
start_urls = ['myurl/hgh/']
def parse(self, response):
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.pdf'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
# get_pdf_content(base_url+path)
with open(base_url+path, 'rb') as f:
doc = slate.PDF(f)
print(doc[0])
我得到了:
File ....spiders\ove_spider.py", line 38, in save_pdf
with open(base_url+path, 'rb') as f:
IOError: [Errno 22] invalid mode ('rb') or filename:
我已经检查了pdf的路径,并且它是正确的(base_url + path给出了远程服务器的绝对路径),因此由于某种原因,这不会打开远程文件进行读取。我怎样才能使这个工作?