我从一个充满pdfs的目录中提取文本。对于此任务,我使用的是python的textract模块:
在:
for filename in glob.glob(os.path.join(input_directory, '*.pdf')):
parsed = process(filename ,method='tesseract', language = 'spa')
输出:
---> 31 get_ipython().magic(u'time transform_files(input_d, out_d)')
/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------
/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081
<decorator-gen-59> in time(self, line, cell, local_ns)
/usr/local/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
/usr/local/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1174 if mode=='eval':
1175 st = clock2()
-> 1176 out = eval(code, glob, local_ns)
1177 end = clock2()
1178 else:
<timed eval> in <module>()
<ipython-input-11-ddedab540f65> in transform_files(input_directory, output_directory)
12
13 filename = os.path.basename(filename)
---> 14 texts = parsed['content']
15 all_texts[filename] = texts
16
TypeError: string indices must be integers, not str
我不知道为什么会发生这种情况,因为documentation states,filename
必须是路径,实际上它只是一条路径。我还尝试使用单个文件进行如下测试:
path = '/pathTo/PDF_FILE.pdf/'
text_ocr = textract.process(path, method='tesseract', language = 'spa')
一切顺利。所以我的问题是,为什么我会这样做:TypeError: string indices must be integers, not str
以及如何正确地将process
应用于filename
。
更新
我还尝试将内容放入词典:
parsed = process(filename ,method='tesseract', language = 'spa', encoding='utf8')
parsed = {"content": parsed}
filename = os.path.basename(filename)
答案 0 :(得分:0)
您似乎正在追逐与您的异常完全无关的变量类型(例如filename
)的许多红色鲱鱼。在回溯的底部,Python会告诉您发生异常的确切位置:
<ipython-input-11-ddedab540f65> in transform_files(input_directory, output_directory)
12
13 filename = os.path.basename(filename)
---> 14 texts = parsed['content']
15 all_texts[filename] = texts
16
TypeError: string indices must be integers, not str
从异常消息中,我们可以推断parsed
是一个字符串,而不是具有'content'
键的字典。查看代码中的早期行,parsed
变量来自对process
的调用。您链接到的textract
文档并没有让我有理由期望process
返回除字符串以外的任何内容。她是他们提供的基本榜样,就在他们页面的顶部:
import textract
text = textract.process('path/to/file.extension')
变量名text
肯定表明你得到了一个字符串!
所以我认为你只需要重写你的循环:
for filename in glob.glob(os.path.join(input_directory, '*.pdf')):
texts = process(filename, method='tesseract', language='spa')
filename = os.path.basename(filename)
all_texts[filename] = texts