你能用石板从pdf中提取文字吗?

时间:2016-09-30 03:41:19

标签: python pdf scrapy

我正在研究scrapy蜘蛛,尝试使用slate(https://pypi.python.org/pypi/slate)在目录中提取多个pdf文本到目前为止我有:

class Ove_Spider(BaseSpider):

    name = "ove"


    allowed_domains = ['myurl.com']
    start_urls = ['myurl/hgh/']


    def parse(self, response):
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        # get_pdf_content(base_url+path)
        with open(base_url+path, 'rb') as f:
            doc = slate.PDF(f)
            print(doc[0])

我得到了:

File ....spiders\ove_spider.py", line 38, in save_pdf
with open(base_url+path, 'rb') as f:
IOError: [Errno 22] invalid mode ('rb') or filename: 

我已经检查了pdf的路径,并且它是正确的(base_url + path给出了远程服务器的绝对路径),因此由于某种原因,这不会打开远程文件进行读取。我怎样才能使这个工作?

0 个答案:

没有答案