Python docx-查找并用斜体字替换单词

时间:2020-06-07 21:49:19

标签: python python-docx

我已经想到了几种方法来实现此目的,但是每种方法都比下一种丑陋。我正在尝试寻找一种方法来搜索单词文档中单词的所有实例并将其斜体。

我无法上传word文档,但是我的想法如下:

enter image description here

一个有效的示例将找到billybob的所有实例(包括表中的实例)并以斜体显示。问题在于,运行频繁对齐的方式意味着一个运行可能具有billy,而下一个运行可能具有bob,因此没有直接的方法可以找到所有运行。

1 个答案:

答案 0 :(得分:0)

我将保持开放状态,因为我想出的方法并不完美,但在大多数情况下都可以使用。这是代码:

document = Document(<YOUR_DOC>)

# Data will be a list of rows represented as dictionaries
# containing each row's data.
characters = {}

for paragraph in <YOUR_PARAGRAPHS>:

    run_string = ""
    run_index = {}
    i = 0

    for x, run in enumerate(paragraph.runs):

        # Create a string consisting of all the runs' text. Theoretically this
        # should always be the same as parapgrah.text, but I didn't check
        run_string = run_string + run.text

        # The index i represents the starting position of the run in question
        # within the string. We are creating a dictionary of form
        # {<run_start_location>: <pointer_to_run>}
        run_index[i] = x

        # This will be the start of the next run
        i = i + len(run.text)

    word_you_wanted_to_find = re.findall("some_regex", paragraph.text)

    for word in word_you_wanted_to_find:

        # [m.start() for m in re.finditer(word, run_string)] returns the starting
        # positions of each word that was found
        for word_start in [m.start() for m in re.finditer(word, run_string)]:
            word_end = word_start + len(word)

            # This will be a list of the indices of the runs which have part
            # of the word we want to include
            included_runs = []

            for key in run_index.keys():

                # Remember, the key is the location in the string of the start of
                # the run. In this case, the start of the word start should be less than
                # the key+len(run) and the end of the word should be greater
                # than the key (the start of the run)
                if word_start <= (key + len(paragraph.runs[run_index[key]].text)) and key < word_end:
                    included_runs.append(key)

                # If the key is larger than or equal to the end of the word,
                # this means we have found all relevant keys. We don't need
                # to loop over the rest (we could, it just wouldn't be efficient)
                if key >= word_end:
                    break

            # At this point, included_runs is a full list of indices to the relevant
            # runs so we can modify each one in turn.
            for run_key in included_runs:
                paragraph.runs[run_index[run_key]].italic = True

document.save(<MODIFIED_DOC>)

问题1

这种方法的问题在于,尽管不常见(至少在我的文档中如此),但单次运行可能包含的不仅仅是目标词。因此,您可能最终会使整个运行都倾斜,包括整个运行,然后再倾斜一些。对于我的用例,在这里解决该问题没有任何意义。

解决方案

如果您要完善我在上面所做的工作,则必须更改此代码块:

if word_start <= (key + len(paragraph.runs[run_index[key]].text)) and key < word_end:
  included_runs.append(key)

在这里,您已经确定了可以说出您的想法的跑步路线。您将需要扩展代码以将单词分隔成自己的运行并将其从当前运行中删除。然后,您可以分别将运行斜体化。

问题2

上面显示的代码不能同时处理表格和普通文本。我不需要用例,但在一般情况下,您必须同时检查两者。