我正在尝试远程从pdf中提取文本。
网址是http://loc.gov/aba/publications/FreeLCC/A-text.pdf
我的代码如下
import urllib2
import PyPDF2
import io
URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf'
remote_file = urllib2.urlopen(URL).read()
memory_file = io.BytesIO(remote_file)
read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()
for i in range(0, number_of_pages):
pageObj = read_pdf.getPage(i)
page = pageObj.extractText()
print (page)
我收到403
HTTP错误。我做错了什么?
答案 0 :(得分:2)
import urllib2
import PyPDF2
import io
URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf'
req = urllib2.Request(URL, headers={'User-Agent' : "Magic Browser"})
remote_file = urllib2.urlopen(req).read()
memory_file = io.BytesIO(remote_file)
read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()
for i in range(0, number_of_pages):
pageObj = read_pdf.getPage(i)
page = pageObj.extractText()
print (page)