Question

我有一个.docx文件，其中为项目符号编号。一个例子是：

1. Main Topic
1.1 Sub Topic 
     Facts on Sub topic
1.2 Sub Topic 1
     Facts on Sub Topic 2
2. Another main topic
2.1 random text
2.2 random text1

我的代码：

import os
import docx2txt
text=""
path = ("my_file.docx")

text=docx2txt.process(path)

我正在这样获得text的值：

Main Topic
Sub Topic 
     Facts on Sub topic
Sub Topic 1
     Facts on Sub Topic 2
Another main topic
random text
random text1

问题：

输出正确，我只需要在输出中也包含编号的项目符号即可。

我在这里错过了一些东西以获得期望的输出

Answer 1

您是否尝试过使用python-docx和python-pptx？在大多数富文本编辑器（如Word）中，项目符号实际上并不是文本内容的一部分，因此提取起来有点困难。但是，在python-pptx中，您可以为纯文本字符串访问Paragraph.text，或者为列表项目符号样式访问Paragraph.style。

我还没有完全研究它，但是这里有Paragraph的许多文档：https://python-pptx.readthedocs.io/en/latest/user/text.html

此外，这似乎是一个必需的功能： https://github.com/scanny/python-pptx/issues/100

Answer 2

您对编号有多特殊？ docx2python将返回此列表。

1) Main Topic
    1) Sub Topic 
Facts on subtopic
    2) Sub Topic 1
Facts on Sub Topic 2
2) Another main topic
    1) random text
    2) random text1

这不是您的确切输入，但是您可以很容易地将其修改回您想要的内容。缩进和数字值就在那里。

如果您只想看到上面的文字：

from docx2python import docx2python

print(docx2python('document.docx').text)

编号的列表将带有制表符缩进，您可以对制表符进行计数并以

开头编写一个小解析器

from docx2python import docx2python
from docx2python.iterators import iter_paragraphs

content = docx2python('document.docx')
paragraphs = list(iter_paragraphs(content.document))

这会将所有页眉，页脚，内容，脚注和尾注文本放入列表中。您可以使用

选择其中的任何部分

content.header
content.footer
content.body
content.footnotes
content.endnotes

代替content.document。

将docx，html，pdf等导出为纯文本的一个问题是纯文本无法缩进段落。您可以使用空格或制表符缩进段落的第一行，但是该段落的其余部分不会缩进。

        My plain-text, tab-indented paragraph will only be indented before
the line wrap, after that...

您必须1）保持原始格式，或者2）接受一些严重的妥协。

项目上的运气最好。

文字转文字::编号的项目符号被删除

2 个答案: