Question

我正在研究scrapy蜘蛛，尝试使用slate（https://pypi.python.org/pypi/slate）在目录中提取多个pdf文本到目前为止我有：

class Ove_Spider(BaseSpider):

    name = "ove"


    allowed_domains = ['myurl.com']
    start_urls = ['myurl/hgh/']


    def parse(self, response):
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        # get_pdf_content(base_url+path)
        with open(base_url+path, 'rb') as f:
            doc = slate.PDF(f)
            print(doc[0])

我得到了：

File ....spiders\ove_spider.py", line 38, in save_pdf
with open(base_url+path, 'rb') as f:
IOError: [Errno 22] invalid mode ('rb') or filename:

我已经检查了pdf的路径，并且它是正确的（base_url + path给出了远程服务器的绝对路径），因此由于某种原因，这不会打开远程文件进行读取。我怎样才能使这个工作？

你能用石板从pdf中提取文字吗？

0 个答案: