Question

我尝试读取{-{1}}文件-

.doc

它确实读取了该文件，但是有很多垃圾，我无法删除该垃圾，因为我不知道它从哪里开始以及在哪里结束。

我还尝试安装with open('file.doc', errors='ignore') as f: text = f.read()模块，该模块可以读取任何文件格式，但是在Windows中下载时存在很多依赖问题。

因此，我还是使用textract命令行实用程序来完成此操作，下面是我的答案。

Answer 1

您可以使用antiword命令行实用程序来执行此操作，我知道你们中的大多数人都会尝试过，但是我仍然想分享。

从here下载antiword
将antiword文件夹提取并粘贴到antiword驱动器中，并将此路径C:\放入C:\antiword变量中。
现在的python代码-
```
PATH
```

现在调用此功能-

import os, docx2txt
def get_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
       return text

这可能是读取filepath = "D:\\input\\" files = os.listdir(filepath) for file in files: text = get_doc_text(filepath, file) print(text)上.doc中的Python文件的一种好方法。

希望它会有所帮助，谢谢。

Answer 2

Mithilesh 的例子很好，但是一旦安装了 antiword，直接使用 textract 会更简单。下载 antiword，并将 antiword 文件夹解压到 C:\。然后将 antiword 文件夹添加到您的 PATH 环境变量中。 (instructions for adding to PATH here)。打开一个新的终端或命令控制台以重新加载您的 PATH 环境变量。使用 pip install textract 安装 texttract。

然后您可以像这样使用 textract（对 .doc 文件使用 antiword）：

import textract
text = textract.process('filename.doc')
text.decode('utf-8')  # converts from bytestring to string

如果您遇到错误，请尝试从终端/控制台运行命令 antiword 以确保其正常工作。还要确保 .doc 文件的文件路径正确（例如，使用 os.path.exists('filename.doc')）。

在Windows中使用反词在Python中读取.doc文件（也是.docx）

2 个答案: