应用错误收集

Textract不会读取希腊文的.doc文件

时间：2018-07-18 21:46:32

标签： python .doc

我正在尝试在Python中读取一堆.doc文件以运行一些文本分析。 (an example of which is attached here) 文档在希腊语中包含某些非ASCII字符（我不知道MS Word使用的编码）。

我正在使用的代码：

import textract
text = textract.process("path_to_sample.doc", extension = 'doc', encoding = "utf_8")

我尝试了许多编码参数，但没有奏效。

有什么想法吗？

0 个答案:

没有答案