Question

我是python的新手，并且在python-docx模块上进行了一些小型动手。我有一个要求，我必须阅读一个包含多个表和文本的Word文档。在本文档中，我必须选择要读取的特定表，并且该选择取决于表上方一行中写的文本，然后我必须处理该表的数据。

我能够通过引用具有索引的表来读取表数据，但是在这种情况下，表索引是未知的，并且可以位于文档中的任何位置。我唯一可以识别表格的是表格上方一行中写的文字。

您能帮我实现这一目标吗？

Answer 1

我有一个使用BeautifulSoup而不是python-docx的解决方案。我在这里所做的是遍历word（.docx）文档的OOXML。

from bs4 import BeautifulSoup
import zipfile

wordoc = input('Enter your file name here or name with path: ')
text1 = 'Enter your text written above the table'
text1 = ''.join(text1.split())
document = zipfile.ZipFile(wordoc)
xml_content = document.read('word/document.xml')
document.close()
soup = BeautifulSoup(xml_content, 'xml')

for document in soup.children:
    for body in document.children:
        for tag in body.children:
            if tag.name == 'p' and (''.join(tag.text.split())) == text1:                
                table = tag.find_next_sibling('w:tbl')
                table_contents = []
                for wtc in table.findChildren('w:tc'):
                    cell_text = ''
                    for wr in wtc.findChildren('w:r'):
                        # We want to exclude striked-out text
                        if not wr.findChildren('w:strike'):
                            cell_text += wr.text
                    table_contents.append(cell_text)
                print(table_contents)

Python在Word文档中搜索特定表

1 个答案: