伙计们我之前发布了一个问题pypdf python tool。不要将此标记为重复,因为我收到此错误,如下所示
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]))
f.close()
# or print contents to the standard out stream
print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
我收到第一个pdf文件的错误
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
以及此pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
如何解决此问题
答案 0 :(得分:2)
我自己尝试了并获得了相同的结果。忽略我的评论,我没有看到你正在将输出写入文件。这就是问题所在:
f.write(convertPdf2String(sys.argv[1]))
当convertPdf2String
返回Unicode字符串,但file.write
只能写字节时,对f.write
的调用会尝试使用ASCII编码自动转换Unicode字符串。由于PDF显然包含非ASCII字符,因此失败。所以它应该像
f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
修改强>
工作源代码,只更改了一行。
# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()
# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")