从URL中提取文本是PDF

时间:2018-07-06 08:59:22

标签: python pdf web-scraping

我想从url中提取文本。该网址自动下载pdf页面。

import io
import requests
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileReader

def extract_info_from_pdf_url(url):
    r = requests.get(url)
    f = io.BytesIO(r.content)
    reader = PdfFileReader(f)
    No_of_pages =  reader.getNumPages()


    for i in range(No_of_pages): 
        contents = reader.getPage(i).extractText().split('\n')
        print(contents)

url = "http://www.oagkenya.go.ke/index.php/reports/doc_download/887-nock-ltd"

extract_info_from_pdf_url(url)

2 个答案:

答案 0 :(得分:1)

您要从中提取文本的pdf实际上只是一堆扫描的照片。由于PdfFileReader和其他pdf阅读器会根据文档的元数据提取文本,因此您将不会获得任何结果(如果文本尚未嵌入PDF中,那么您将需要使用OCR来提取文本。)

您可以为此使用Tesseract,Tesseract不会占用pdf的格式,因此可以使用类似convert的方法将.pdf转换为.tiff:

convert -density 300 /path/to/my/document.pdf -depth 8 -strip -background white -alpha off file.tiff

然后在该文件上使用tesseract:

tesseract file.tiff output.txt

答案 1 :(得分:0)

import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PDFreader {
    public static void main(String[] args) throws Exception

    {

        URL url = new URL("http:/....view.php?fil_Id=5515");
        byte[] response = null;

        try (InputStream in = new BufferedInputStream(url.openStream());
                ByteArrayOutputStream out = new ByteArrayOutputStream()) {

            byte[] buf = new byte[1024];
            int n = 0;
            int counter = 0;

            while (-1 != (n = in.read(buf))) {
                out.write(buf, 0, n);
                counter = counter + n;
            }
            response = out.toByteArray();
        }

        OutputStream os = new FileOutputStream("abc.pdf");
        os.write(response);
        os.close();

        File file = new File("abc.pdf");
        PDDocument document = PDDocument.load(file);
        PDFTextStripper pdfStripper = new PDFTextStripper();
        String text = pdfStripper.getText(document);
        System.out.println(text);
        document.close();

    }

}