Python pdf到txt

时间:2016-10-21 15:55:26

标签: python-2.7

我想将pdf文件转换为txt。这是我的代码:

testFile = urllib.URLopener()
testFile.retrieve("http://url_to_download" , "/Users/gabor_dev/Desktop/pdf_tst/tst.pdf")

content = ""

pdf = pyPdf.PdfFileReader(file("/Users/gabor_dev/Desktop/pdf_tst/tst.pdf", "rb"))

for i in range(0, pdf.getNumPages()):
    f = open("/Users/gabor_dev/Desktop/pdf_tst/xxx.txt",'a')
    content= pdf.getPage(i).extractText() + "\n"
    c=content.split()
    for a in c:
        f.write(" ")
        f.write(a)
        f.write('\n')
        f.close()

我的pdf已下载,但当我尝试将其转换为我的txt时,只有pdf的第一个单词显示在我的txt文件中,然后我收到此错误:

Traceback (most recent call last):
  File "/Users/gabor_dev/PycharmProjects/text_class_tst/textClass.py", line 26, in <module>
    f.write(" ")
ValueError: I/O operation on closed file

我做错了什么? 谢谢!

1 个答案:

答案 0 :(得分:0)

更好地使用with open

import urllib
import pyPdf

testFile = urllib.URLopener()
testFile.retrieve("http://www.pdf995.com/samples/pdf.pdf" , "./tst.pdf")

content = ""

pdf = pyPdf.PdfFileReader(file("./tst.pdf", "rb"))


with open("./xxx.txt",'a') as f :
    for i in range(0, pdf.getNumPages()):
        content= pdf.getPage(i).extractText() + "\n"
        c=content.split()
        for a in c:
            f.write(" ")
            f.write(a)
            f.write('\n')

经过测试和工作