Question

我有一堆Word docx文件，它们具有相同的嵌入式Excel表格。我试图从几个文件中提取相同的单元格。

我想出了如何硬编码到一个文件：

from docx import Document

document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text

print Project

但是如何批量处理？我尝试了listdir的一些变体，但它们并不适合我，而且我太绿了，无法自己到达那里。

Answer 1

如何遍历所有文件将真正取决于您的项目可交付成果。所有文件都在一个文件夹中吗？是否还有.docx个文件？

要解决所有问题，我们假设有子目录，以及与.docx文件混合的其他文件。为此，我们将使用os.walk()和os.path.splitext()

import os

from docx import Document

# First, we'll create an empty list to hold the path to all of your docx files
document_list = []       

# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx" 
# (and all it's subfolders) using os.walk().  You could alternatively use os.listdir()
# to get a list of files.  It would be recommended, and simpler, if all files are
# in the same folder.  Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"): 
    for name in files:
        # For each file we find, we need to ensure it is a .docx file before adding
        #  it to our list
        if os.path.splitext(os.path.join(path, name))[1] == ".docx":
            document_list.append(os.path.join(path, name))

# Now create a loop that goes over each file path in document_list, replacing your 
# hard-coded path with the variable.
for document_path in document_list:
    document = Document(document_path)        # Change the document being loaded each loop
    table = document.tables[0]
    project_cell = table.rows[2].cells[2]
    paragraph = project_cell.paragraphs[0]
    project = paragraph.text

    print project

有关其他内容，请参阅os.listdir()上的文档。

此外，最好将您的代码放入可重复使用的功能中，但这对您自己也是一个挑战！

Answer 2

假设上面的代码可以为您提供所需的数据，您只需从磁盘读取文件并进行处理即可。

首先让我们定义一个执行你已经做过的事情的函数，然后我们循环遍历目录中的所有文档并使用该函数处理它们。

编辑以下未经测试的代码以满足您的需求。

# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:

# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.

import document
import os 


DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'

def get_para(document):
    table = document.tables[0]
    Project_cell = table.rows[2].cells[2]
    paragraph = Project_cell.paragraphs[0]
    Project = paragraph.text
    return Project

if __name__ == "__main__":
    paragraphs = []
    f = os.walk(DOCS).next()
    for filename in f:
        file_name = os.path.join(DOCS, filename)
        document = Document(file_name)
        para = get_para(document)
        paragraphs.append(para)

    print(paragraphs)

在目录中搜索所有带有python-docx的docx文件（批处理）

2 个答案: