我写了一个函数,可以将目录中的每个pdf转换为文本,我想从pdf的txt文件中获取转换后的文本。我的代码中出现“ TypeError:预期的str,字节或os.PathLike对象,而不是元组”错误。谁能帮我这个忙。在此处附加代码:
import io
import os
import os.path
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_from_pdf(pdf_path):
resource_manager = PDFResourceManager()
fake_file_handle = io.BytesIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(pdf_path, 'rb') as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
# close open handles
converter.close()
fake_file_handle.close()
if text:
return text
def save_to_txt(lst):
for i, ele in enumerate(lst):
txtfile = "{}.txt".format(i)
files = extract_text_from_pdf(ele)
with open(txtfile, "w") as textfile:
textfile.write(files)
if __name__ == '__main__':
pdf_path = 'C:\\Users\\Lenovo\\.spyder-py3\\OCR'
for root, _, files in os.walk(pdf_path):
for filename in files:
filepath = os.path.join(root, filename)
extract_text_from_pdf(filepath)
for f in filepath:
save_to_txt(f)
错误如下:
runfile('C:/Users/Lenovo/.spyder-py3/updatedpy.py', wdir='C:/Users/Lenovo/.spyder-py3')
Traceback (most recent call last):
File "<ipython-input-17-f6b3bb00c382>", line 1, in <module>
runfile('C:/Users/Lenovo/.spyder-py3/updatedpy.py', wdir='C:/Users/Lenovo/.spyder-py3')
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Lenovo/.spyder-py3/updatedpy.py", line 47, in <module>
extract_text_from_pdf(file)
File "C:/Users/Lenovo/.spyder-py3/updatedpy.py", line 22, in extract_text_from_pdf
with open(pdf_path, 'rb') as fh:
TypeError: expected str, bytes or os.PathLike object, not tuple
答案 0 :(得分:2)
该错误是由于在您的主要部分使用了os.walk方法而导致的,该方法不返回文件名,而是返回一个元组。有关更多详细信息,请参见os documentation。
编辑:您可以使用os.walk方法,如下所示:
for root, _, files in os.walk(pdf_path):
for filename in files:
filepath = os.path.join(root, filename)
extract_text_from_pdf(filepath)
或,您可以使用path.py库并使用walkfiles method。这样就可以做到:
from path import Path
pdf_path = Path('C:\\dev')
for file in pdf_path.walkfiles():
extract_text_from_pdf(file)