我想从url中提取文本。该网址自动下载pdf页面。
import io
import requests
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileReader
def extract_info_from_pdf_url(url):
r = requests.get(url)
f = io.BytesIO(r.content)
reader = PdfFileReader(f)
No_of_pages = reader.getNumPages()
for i in range(No_of_pages):
contents = reader.getPage(i).extractText().split('\n')
print(contents)
url = "http://www.oagkenya.go.ke/index.php/reports/doc_download/887-nock-ltd"
extract_info_from_pdf_url(url)
答案 0 :(得分:1)
您要从中提取文本的pdf实际上只是一堆扫描的照片。由于PdfFileReader和其他pdf阅读器会根据文档的元数据提取文本,因此您将不会获得任何结果(如果文本尚未嵌入PDF中,那么您将需要使用OCR来提取文本。)
您可以为此使用Tesseract,Tesseract不会占用pdf的格式,因此可以使用类似convert的方法将.pdf转换为.tiff:
convert -density 300 /path/to/my/document.pdf -depth 8 -strip -background white -alpha off file.tiff
然后在该文件上使用tesseract:
tesseract file.tiff output.txt
答案 1 :(得分:0)
import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFreader {
public static void main(String[] args) throws Exception
{
URL url = new URL("http:/....view.php?fil_Id=5515");
byte[] response = null;
try (InputStream in = new BufferedInputStream(url.openStream());
ByteArrayOutputStream out = new ByteArrayOutputStream()) {
byte[] buf = new byte[1024];
int n = 0;
int counter = 0;
while (-1 != (n = in.read(buf))) {
out.write(buf, 0, n);
counter = counter + n;
}
response = out.toByteArray();
}
OutputStream os = new FileOutputStream("abc.pdf");
os.write(response);
os.close();
File file = new File("abc.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
document.close();
}
}