使用python-docx

时间:2016-11-17 21:43:09

标签: metadata python-docx

我在文件夹和子文件夹中有大约300个docx文件,我需要更新元数据。我有一个单独的300+行csv文件,其中包含元数据:每行包含文件名,关键字,行标题。

我想循环遍历docx文件,从csv中提取内容并将元数据插入docx文件。 Docx文件存储在根文件夹下的2个子文件夹中。

到目前为止,我已经草拟了以下内容。我正在努力解决的问题是如何遍历csv文件并按顺序将元数据应用于每个文件。我确信有一种相对简单的方法来解决这个问题,设置循环并获取csv内容是我迷失的地方。我是一个菜鸟,在我走的路上有这种感觉。

任何提示赞赏。

#running in python 3.5.2 32bit
import csv
from docx import Document
import os
import sys

csv_path = ("datasheet_metadata_uplift.csv")

def update_docx_metadata(document, keywords, title):
    """
    Update the *keywords*, and *title* metadata
    properties in *document*.
    """
    core_properties = document.core_properties
    core_properties.keywords = keywords
    core_properties.title = title

def read_csv_lines(filename, keywords, title):
    """
    Reads the csv lines, returns *filename*, *keywords*, *title*
    """
    with open(csv_path, 'r') as f:
        csv_file = csv.reader(f)
        for row in csv_file:
            filename = row[0]
            keywords = row[1]
            title = row[2]

def open_docx(filename):
     """
     Search for docx file and open it 
     """
     for root, dirs, files in os.walk("."):
         if filename in files:
            doc_path = os.path.join(path, filename)

csv_lines = read_csv_lines(filename, keywords, title)
for filename, keywords, title in csv_lines:
    document = Document(doc_path)
    update_doc_metadata(filename, keywords, title)
    document.save(doc_path)

2 个答案:

答案 0 :(得分:0)

下一步我建议Aidan将您的代码重构为连贯的函数。这将允许您在需要时执行所需的操作,每个操作都有一个函数调用,这样就不会模糊意图和流程。

你可能会从这样的事情开始:

def update_doc_metadata(document, author, keywords, title, subject):
    """
    Update the *author*, *keywords*, *title*, and *subject* metadata
    properties in *document*.
    """
    core_properties = document.core_properties
    core_properties.author = author
    core_properties.keywords = keywords
    core_properties.title = title
    core_properties.subject = subject

请注意以下几点:

  • 它是连贯的,意味着它只做一件事。这使得它更具可重用性。
  • 它不依赖于任何不作为参数进入的东西。这使得测试变得容易(如果你这样做)并且通常易于理解,因为你需要的所有上下文都在这十行中。
  • 它有一个文档字符串,明确说明它的作用。这是一个有用的学科,不仅因为它有助于读者(很可能是你,几周或几个月后)理解意图,而是因为它迫使你解释你在做什么。很多时候你可以检测出错误的因素,因为它很难或很难解释。 (参数周围的星号使它们在某些文档包中以斜体显示。)

如果你继续这样做,将相干位定位并“提取”到函数中,主代码的核心逻辑将变得更加清晰。

我认为整体结构是这样的:

csv_lines = read_csv_lines(csv_path)
for filename, keywords, title in csv_lines:
    doc_path, document = open_docx(filename)
    update_doc_metadata(document, author, keywords, title, subject)
    document.save(doc_path)

答案 1 :(得分:0)

所以我想到了这一点,结果很简单。通过将完整的文件路径放在csv中,我也使自己更容易。感谢scanny的鼓励。下一站,文档和教程页面:)

#runs in python 3.5.2 32-bit
#docx requires 32 bit operation
import csv
from docx import Document
import os
import sys

#path to the csv file - csv file must contain rows as follows:
#full filepath, title, subject
#ensure there are no commas, other than the csv delimiters

csv_path = "datasheet_metadata_uplift.csv"

#set up the lists that will be used to hold csv values 
filename = []
title = []
keywords = []

#sets up the csv file, and parses the "columns" to one of three lists: filename, title, keywords
f = open(csv_path)
csv_file = csv.reader(f)

#chops up csv into [] lists
for row in csv_file:
    filename.append(row[0])
    title.append(row[1])
    keywords.append(row[2])

#get the number of lines in the csv, and thus the number of files that need updating
file = open(csv_path)
num_lines = len(file.readlines())

#do the updates on every filename in the list
i = 0
while i < num_lines:
    if i < num_lines:
        #update the docx files, one for each csv file entry
        document = Document(filename[i])
        core_properties = document.core_properties
        core_properties.keywords = (keywords[i])
        core_properties.title = (title[i])
        core_properties.subject = ("YOUR_SUBJECT_HERE")
        core_properties.comments = (" ")
        core_properties.company = ("YOUR_COMPANY_HERE")
        document.save(filename[i])
        i+=1

print ("finished!")