Question

我使用Textract并且相对较新的Python，我想用unicode字符串而不是utf-8加载文件。有没有办法做到这一点？

我试过

text = textract.process(file)

但这会加载一个UTF-8字符串，而我更喜欢unicode。我尝试使用

text = textract.process(file, encoding="unicode")

但这会引发错误。

Error
Traceback (most recent call last):
  File "/home/moha/dev/intellij-ws/pyqadi/tests/test_file2txt.py", line 11, in test_process
    str=f2t.to_txt(file)
  File "/home/moha/dev/intellij-ws/pyqadi/textsearcher/file2txt.py", line 10, in to_txt
    text = textract.process(file, encoding="unicode")
  File "/usr/local/lib/python2.7/dist-packages/textract/parsers/__init__.py", line 57, in process
    return parser.process(filename, encoding, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 46, in process
    return self.encode(unicode_string, encoding)
  File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 31, in encode
    return text.encode(encoding, 'ignore')
LookupError: unknown encoding: unicode

Answer 1

Textract使用编码来指定特定的输出编码（使用chardet

推断输入编码

以下是用于编码的Uncidoe选项：

unicode_escape, unicode_internal, raw_unicode_escape
text = textract.process(file, encoding = 'unicode_escape')

这是exhaustive list。

基础数据采用UTF-8格式。您可以将textract.processn作为UTF-8并在单独的行上将其解码为Unicode：

text = textract.process(file)

Utext = unicode(text,'utf-8')

Answer 2

这个简单的方法对我有用：

import textract as txt
text = txt.process(file)
text = text.decode("utf8")

如何在Python中使用Textract库在unicode中加载字符串？

2 个答案: