使用多个文件时,字符串索引必须是整数,而不是str异常?

时间:2016-10-23 17:54:06

标签: python string python-2.7 python-3.x io

我从一个充满pdfs的目录中提取文本。对于此任务,我使用的是python的textract模块:

在:

    for filename in glob.glob(os.path.join(input_directory, '*.pdf')):    
        parsed = process(filename ,method='tesseract', language = 'spa')

输出:

---> 31 get_ipython().magic(u'time transform_files(input_d, out_d)')

/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
   2156         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2157         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158         return self.run_line_magic(magic_name, magic_arg_s)
   2159 
   2160     #-------------------------------------------------------------------------

/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
   2077                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2078             with self.builtin_trap:
-> 2079                 result = fn(*args,**kwargs)
   2080             return result
   2081 

<decorator-gen-59> in time(self, line, cell, local_ns)

/usr/local/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    186     # but it's overkill for just that one bit of state.
    187     def magic_deco(arg):
--> 188         call = lambda f, *a, **k: f(*a, **k)
    189 
    190         if callable(arg):

/usr/local/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1174         if mode=='eval':
   1175             st = clock2()
-> 1176             out = eval(code, glob, local_ns)
   1177             end = clock2()
   1178         else:

<timed eval> in <module>()

<ipython-input-11-ddedab540f65> in transform_files(input_directory, output_directory)
     12 
     13         filename = os.path.basename(filename)
---> 14         texts = parsed['content']
     15         all_texts[filename] = texts
     16 

TypeError: string indices must be integers, not str

我不知道为什么会发生这种情况,因为documentation statesfilename必须是路径,实际上它只是一条路径。我还尝试使用单个文件进行如下测试:

path = '/pathTo/PDF_FILE.pdf/'
text_ocr = textract.process(path, method='tesseract', language = 'spa')

一切顺利。所以我的问题是,为什么我会这样做:TypeError: string indices must be integers, not str以及如何正确地将process应用于filename

更新

我还尝试将内容放入词典:

parsed = process(filename ,method='tesseract', language = 'spa', encoding='utf8')
parsed = {"content": parsed}
filename = os.path.basename(filename)

1 个答案:

答案 0 :(得分:0)

您似乎正在追逐与您的异常完全无关的变量类型(例如filename)的许多红色鲱鱼。在回溯的底部,Python会告诉您发生异常的确切位置:

<ipython-input-11-ddedab540f65> in transform_files(input_directory, output_directory)
     12 
     13         filename = os.path.basename(filename)
---> 14         texts = parsed['content']
     15         all_texts[filename] = texts
     16 

TypeError: string indices must be integers, not str

从异常消息中,我们可以推断parsed是一个字符串,而不是具有'content'键的字典。查看代码中的早期行,parsed变量来自对process的调用。您链接到的textract文档并没有让我有理由期望process返回除字符串以外的任何内容。她是他们提供的基本榜样,就在他们页面的顶部:

import textract
text = textract.process('path/to/file.extension')

变量名text肯定表明你得到了一个字符串!

所以我认为你只需要重写你的循环:

for filename in glob.glob(os.path.join(input_directory, '*.pdf')):    
    texts = process(filename, method='tesseract', language='spa')
    filename = os.path.basename(filename)
    all_texts[filename] = texts