Python docx在保留样式

时间:2016-01-14 00:39:18

标签: python python-2.7 python-docx

我需要帮助替换word文档中的字符串,同时保持整个文档的格式。

我正在使用python-docx,在阅读文档之后,它适用于整个段落,所以我放松了格式,如粗体或斜体字。 包括要替换的文本是粗体,我想保持这种方式。 我正在使用此代码:

from docx import Document
def replace_string2(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'Text to find and replace' in p.text:
            print 'SEARCH FOUND!!'
            text = p.text.replace('Text to find and replace', 'new text')
            style = p.style
            p.text = text
            p.style = style
    # doc.save(filename)
    doc.save('test.docx')
    return 1

因此,如果我实现它并想要类似(包含要替换的字符串的段落丢失其格式):

这是第1段,这是粗体中的文字。

这是第2段,我将替换旧文

目前的结果是:

这是第1段,这是粗体中的文字。

这是第2段,我将替换新文本

4 个答案:

答案 0 :(得分:10)

我发布了这个问题(尽管我在这里看到了几个相同的问题),因为这些(据我所知)都没有解决这个问题。有一个使用oodocx库,我试过,但没有工作。所以我找到了解决方法。

代码非常相似,但逻辑是:当我找到包含我想要替换的字符串的段落时,使用运行添加另一个循环。 (这只有在我想要替换的字符串具有相同格式时才有效。)

def replace_string(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'old text' in p.text:
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if 'old text' in inline[i].text:
                    text = inline[i].text.replace('old text', 'new text')
                    inline[i].text = text
            print p.text

    doc.save('dest1.docx')
    return 1

答案 1 :(得分:2)

这是我在替换文本时保留文本样式的方法。

基于Alo的回答以及搜索文本可以分为多个运行的事实,这是我在模板docx文件中替换占位符文本的方法。它会检查所有文档段落以及占位符的所有表单元格内容。

在段落中找到搜索文本后,它将循环遍历运行,以标识哪个运行包含搜索文本的部分文本,然后在第一次运行中插入替换文本,然后在剩余的跑步次数。

我希望这对某人有帮助。这是gist,如果有人想对其进行改进

编辑: 随后,我发现了python-docx-template,它允许在docx模板中进行jinja2样式的模板化。这是指向documentation

的链接

def docx_replace(doc, data):
    paragraphs = list(doc.paragraphs)
    for t in doc.tables:
        for row in t.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    paragraphs.append(paragraph)
    for p in paragraphs:
        for key, val in data.items():
            key_name = '${{{}}}'.format(key) # I'm using placeholders in the form ${PlaceholderName}
            if key_name in p.text:
                inline = p.runs
                # Replace strings and retain the same style.
                # The text to be replaced can be split over several runs so
                # search through, identify which runs need to have text replaced
                # then replace the text in those identified
                started = False
                key_index = 0
                # found_runs is a list of (inline index, index of match, length of match)
                found_runs = list()
                found_all = False
                replace_done = False
                for i in range(len(inline)):

                    # case 1: found in single run so short circuit the replace
                    if key_name in inline[i].text and not started:
                        found_runs.append((i, inline[i].text.find(key_name), len(key_name)))
                        text = inline[i].text.replace(key_name, str(val))
                        inline[i].text = text
                        replace_done = True
                        found_all = True
                        break

                    if key_name[key_index] not in inline[i].text and not started:
                        # keep looking ...
                        continue

                    # case 2: search for partial text, find first run
                    if key_name[key_index] in inline[i].text and inline[i].text[-1] in key_name and not started:
                        # check sequence
                        start_index = inline[i].text.find(key_name[key_index])
                        check_length = len(inline[i].text)
                        for text_index in range(start_index, check_length):
                            if inline[i].text[text_index] != key_name[key_index]:
                                # no match so must be false positive
                                break
                        if key_index == 0:
                            started = True
                        chars_found = check_length - start_index
                        key_index += chars_found
                        found_runs.append((i, start_index, chars_found))
                        if key_index != len(key_name):
                            continue
                        else:
                            # found all chars in key_name
                            found_all = True
                            break

                    # case 2: search for partial text, find subsequent run
                    if key_name[key_index] in inline[i].text and started and not found_all:
                        # check sequence
                        chars_found = 0
                        check_length = len(inline[i].text)
                        for text_index in range(0, check_length):
                            if inline[i].text[text_index] == key_name[key_index]:
                                key_index += 1
                                chars_found += 1
                            else:
                                break
                        # no match so must be end
                        found_runs.append((i, 0, chars_found))
                        if key_index == len(key_name):
                            found_all = True
                            break

                if found_all and not replace_done:
                    for i, item in enumerate(found_runs):
                        index, start, length = [t for t in item]
                        if i == 0:
                            text = inline[index].text.replace(inline[index].text[start:start + length], str(val))
                            inline[index].text = text
                        else:
                            text = inline[index].text.replace(inline[index].text[start:start + length], '')
                            inline[index].text = text
                # print(p.text)

# usage

doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')

答案 2 :(得分:2)

根据DOCX文档的架构:

  1. 文本:doc>段落>运行
  2. 文本表格:doc>Form>row>col>cell>Paragraph>run
  3. 标题:doc>sections>header>Paragraph>run
  4. 标题表:doc>sections>header>Form>row>col>cell>Paragraph>run

页脚和页眉一样,我们可以直接遍历段落来查找和替换我们的关键字,但是这样会导致文本格式被重置,所以我们只能遍历运行中的单词并替换它们。但是,由于我们的关键字可能会超出运行的长度范围,因此我们无法成功替换它们。

因此,我在这里提供一个思路:首先,以段落为单位,通过列表标记段落中每个字符的位置;然后,标记遍历列表中每个字符的位置;找出段落中的关键字,按对应关系删除并以字符为单位替换。

'''
-*- coding: utf-8 -*-
@Time    : 2021/4/19 13:13
@Author  : ZCG
@Site    : 
@File    : Batch DOCX document keyword replacement.py
@Software: PyCharm
'''

from docx import Document
import os
import tqdm

def get_docx_list(dir_path):
    '''
    :param dir_path:
    :return: List of docx files in the current directory
    '''
    file_list = []
    for path,dir,files in os.walk(dir_path):
        for file in files:
            if file.endswith("docx") == True and str(file[0]) != "~":  #Locate the docx document and exclude temporary files
                file_root = path+"\\"+file
                file_list.append(file_root)
    print("The directory found a total of {0} related files!".format(len(file_list)))
    return file_list

class ParagraphsKeyWordsReplace:
    '''
        self:paragraph
    '''
    def paragraph_keywords_replace(self,x,key,value):
        '''
        :param x:  paragraph index
        :param key: Key words to be replaced
        :param value: Replace the key words
        :return:
        '''
        keywords_list = [s for s in range(len(self.text)) if self.text.find(key, s) == s] # Retrieve the number of occurrences of the Key in this paragraph and record the starting position in the List
        # there if use: while self.text.find(key) >= 0,When {"ab":" ABC "} is encountered, it will enter an infinite loop
        while len(keywords_list)>0:             #If this paragraph contains more than one key, you need to iterate
            index_list = [] #Gets the index value for all characters in this paragraph
            for y, run in enumerate(self.runs):  # Read the index of run
                for z, char in enumerate(list(run.text)):  # Read the index of the chars in the run
                    position = {"run": y, "char": z}  # Give each character a dictionary index
                    index_list.append(position)
            # print(index_list)
            start_i = keywords_list.pop()   # Fetch the starting position containing the key from the back to the front of the list
            end_i = start_i + len(key)      # Determine where the key word ends in the paragraph
            keywords_index_list = index_list[start_i:end_i]  # Intercept the section of a list that contains keywords in a paragraph
            # print(keywords_index_list)
            # return keywords_index_list    #Returns a list of coordinates for the chars associated with keywords
            ParagraphsKeyWordsReplace.character_replace(self, keywords_index_list, value)
            # print(f"Successful replacement:{key}===>{value}")

    def character_replace(self,keywords_index_list,value):
        '''
        :param keywords_index_list: A list of indexed dictionaries containing keywords
        :param value: The new word after the replacement
        : return:
        Receive parameters and delete the characters in keywords_index_list back-to-back, reserving the first character to replace with value
        Note: Do not delete the list in reverse order, otherwise the list length change will cause a string index out of range error
        '''
        while len(keywords_index_list) > 0:
            dict = keywords_index_list.pop()    #Deletes the last element and returns its value
            y = dict["run"]
            z = dict["char"]
            run = self.runs[y]
            char = self.runs[y].text[z]
            if len(keywords_index_list) > 0:
                run.text = run.text.replace(char, "")       #Delete the [1:] character
            elif len(keywords_index_list) == 0:
                run.text = run.text.replace(char, value)    #Replace the 0th character

class DocxKeyWordsReplace:
    '''
        self:docx
    '''
    def content(self,replace_dict):
        print("Please wait for a moment, the body content is processed...")
        for key, value in tqdm.tqdm(replace_dict.items()):
            for x,paragraph in enumerate(self.paragraphs):
                ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph,x,key,value)

    def tables(self,replace_dict):
        print("Please wait for a moment, the body tables is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i,table in enumerate(self.tables):
                for j,row in enumerate(table.rows):
                    for cell in row.cells:
                        for x,paragraph in enumerate(cell.paragraphs):
                            ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph,x,key,value)

    def header_content(self,replace_dict):
        print("Please wait for a moment, the header body content is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i,sections in enumerate(self.sections):
                for x,paragraph in enumerate(self.sections[i].header.paragraphs):
                    ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)

    def header_tables(self,replace_dict):
        print("Please wait for a moment, the header body tables is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i,sections in enumerate(self.sections):
                for j,tables in enumerate(self.sections[i].header.tables):
                    for k,row in enumerate(tables[j].rows):
                        for l,cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)

    def footer_content(self, replace_dict):
        print("Please wait for a moment, the footer body content is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i, sections in enumerate(self.sections):
                for x, paragraph in enumerate(self.sections[i].footer.paragraphs):
                    ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)


    def footer_tables(self, replace_dict):
        print("Please wait for a moment, the footer body tables is processed...")
        for key,value in tqdm.tqdm(replace_dict.items()):
            for i, sections in enumerate(self.sections):
                for j, tables in enumerate(self.sections[i].footer.tables):
                    for k, row in enumerate(tables[j].rows):
                        for l, cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                ParagraphsKeyWordsReplace.paragraph_keywords_replace(paragraph, x, key, value)

def main():
    '''
    How to use it: Modify the values in replace_dict and file_dir
    Replace_dict: The following dictionary corresponds to the format, with key as the content to be replaced and value as the new content
    File_dir: The directory where the docx file resides. Supports subdirectories
    '''
    # Input part
    replace_dict = {
        "MG life technology (shenzhen) co., LTD":"Shenzhen YW medical technology co., LTD",
        "MG-":"YW-",
        "2017-":"2020-",
        "Z18":"Z20",

        }
    file_dir = r"D:\Working Files\SVN\"
    # Call processing part
    for i,file in enumerate(get_docx_list(file_dir),start=1):
        print(f"{i}、Files in progress:{file}")
        docx = Document(file)
        DocxKeyWordsReplace.content(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.tables(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.header_content(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.header_tables(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.footer_content(docx, replace_dict=replace_dict)
        DocxKeyWordsReplace.footer_tables(docx, replace_dict=replace_dict)
        docx.save(file)
        print("This document has been processed!\n")

if __name__ == "__main__":
    main()
    print("All complete processing!")

答案 3 :(得分:0)

from docx import Document

document = Document('old.docx')

dic = {'name':'ahmed','me':'zain'}
for p in document.paragraphs:
    inline = p.runs
    for i in range(len(inline)):
        text = inline[i].text
        if text in dic.keys():
            text=text.replace(text,dic[text])
            inline[i].text = text

document.save('new.docx')