Question

每天我都会收到一封包含Word文档的电子邮件。文档中的所有文本都存在于文档的表中。我有数百个这样的Word文档（我每天都会得到一份）。我想使用python打开每个文档，复制所需的文本，然后将其粘贴到excel文档中。但是，我在第一部分就陷入了困境。我无法从word文档中提取文本。我正在尝试使用python-docx模块提取文本，但是我不知道如何从表中读取文本。

我在我正在阅读的python入门书中修改了getText模块，但它似乎不起作用。我什至在正确的轨道上吗？

import docx
fullText = []

def getText(filename):
    doc = docx.Document(filename)
    for table in doc.Tables:
        for row in table.Rows:
            for cell in row.Cells:
                  fullText.append(cell.text)
    return '\n'.join(fullText)

好吧，看了this other question之后，我意识到我实际上遇到了与我想像不同的问题。我进行了更改，并具有以下代码：

import docx
fullText = []

doc = docx.Document('c:\\btest\\January18.docx')
for table in doc.tables:
    for row in table.rows:
            for cell in row.cells:
                  fullText.append(cell.text)
'\n'.join(fullText)

print(fullText)

它正在打印出来：

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

问题是，word文档中的表不是空白单元格，因此它们不应返回空白。我在做什么错了？

A sample input document is here

我正在尝试从该文档中拉出某些文本行，并以所需的方式粘贴和格式化文本。但是，我什至无法访问word文档中的文本...

Answer 1

我能够解析sample doc并使用以下脚本将其保存到Excel文件中：

import re
import pandas
import docx2txt

INPUT_FILE = 'jantest2.docx'
OUTPUT_FILE = 'jantest2.xlsx'

text = docx2txt.process(INPUT_FILE)
results = re.findall(r'(\d+-\d+)\n\n(.*)\n\n(.*)\n\n(.*)', text)
data = {'Case Number': [x[0] for x in results],
        'Report Date': [x[1] for x in results],
        'Address': [x[2] for x in results],
        'Statute Descripiton': [x[3] for x in results]}

data_frame = pandas.DataFrame(data=data)
writer = pandas.ExcelWriter(OUTPUT_FILE)
data_frame.to_excel(writer, 'Sheet1', index=False)
writer.save()

所以这是我在Excel文件中得到的：

Word表中的文字

1 个答案: