Question

'''此脚本用于将文档（docx）中的文本复制到简单文本文件

'''

import sys
import ntpath
import os
from docx import Document

docpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\1-100')
txtpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\Textfiles')

for filename in os.listdir(docpath):
    try:
        document = Document(os.path.join(docpath, filename))
        # print(document.paragraphs)
        print(filename)
        savetxt = os.path.join(txtpath, ntpath.basename(filename).split('.')[0] + ".txt")
        print('Reading ' + filename)
        # print(savetxt)
        fullText = []
        for para in document.paragraphs:
            # print(para.text)
            fullText.append(para.text)
        with open(savetxt, 'wt') as newfile:
            for item in fullText:
                newfile.write("%s\n" % item)
        # with open(savetxt, 'a') as f:
        # f.write(para.text)
        # print(" ".join([line.rstrip('\n') for line in f]))
        # newfile.write(fullText)
        # newfile.save()
        # newfile.save()
        #
        # newfile.write('\n\n'.join(fullText))
        # newfile.close()

    except:
        # print(filename)
        # document = Document(os.path.join(docpath, filename))
        # print(document.paragraphs)
        print('Please fix an error')
        exit()

    # print("Please supply an input and output file. For example:\n"
 # #  "  example-extracttext.py 'My Office 2007 document.docx' 'outp"
 #   "utfile.txt'")

    # Fetch all the text out of the document we just created

    # Make explicit unicode version

    # Print out text of document with two newlines under each paragraph

print(savetxt)

以上python 3脚本是读取Docx文件并创建txt文件。在一个目录中，我有100个docx文件，但它只创建19个txt文件然后退出程序。我无法理解为什么？

Docx文件是来自OCR软件的输出文件，都是英文文本（没有图像，表格或图形或特殊的东西）。

今天我再次在删除Try / Except指令后运行程序，结果相同：

1.docx
阅读1.docx
10.docx
阅读10.docx
100.docx
阅读100.docx
11.docx
阅读11.docx
12.docx
读12.docx
13.docx
阅读13.docx
14.docx
阅读14.docx
15.docx
阅读15.docx
16.docx
阅读16.docx
17.docx
阅读17.docx
18.docx
阅读18.docx
追溯（最近的呼叫最后）：
  文件“C：\ Users \ Khairul Basar \ Documents \ CWD Projects \ docx2txtv2.py”，第26行，
在     newfile.write（“％s \ n”％item）
  文件“C：\ Python36 \ lib \ encodings \ cp1252.py”，第19行，在编码中     return codecs.charmap_encode（input，self.errors，encoding_table）[0]
UnicodeEncodeError：'charmap'编解码器无法在位置
中编码字符'\ u0113' 77：角色映射到

其他帖子 Here通过.encode解决此问题（“utf-8”）但如果我使用它，那么我会在每一行中得到'我的文本' - 我不需要它。

更新已修复

我已更改为以下行：打开（savetxt，'w'，encoding ='utf-8'）作为newfile：

添加encoding ='utf-8'

帮助我从这篇文章中获取了帮助。 post

谢谢你以一种很好的方式编写了我的帖子。

Answer 1

usr2564301已指出从代码中删除Try / except。通过这样做，我得到了确切的错误，为什么它没有工作或过早退出程序。

问题是我的Docx有很多超过8位字符集的字符。要将非英文字符转换为英文编码=＆＃39; utf-8＆＃39;使用。

解决了这个问题。

无论如何，所有的功劳都归功于usr2564301，这是我不知道的地方。

使用python在.docx扩展名的目录中列出文件名

1 个答案: