Question

我的PowerPoint幻灯片由文本框组成，有时包含在组形状中。从这些数据提取数据时，不会按顺序提取文本。有时会首先提取ppt末尾的文本框，有时会提取中间的文本框，依此类推。

以下代码从文本框获取文本，并且也处理组对象。

for eachfile in files:    
    prs = Presentation(eachfile)
    textrun=[]
    # ---Only on text-boxes outside group elements---
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
                textrun.append(shape.text)

        # ---Only operate on group shapes---
        group_shapes = [shp for shp in slide.shapes 
                        if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    print(shape.text)
                    textrun.append(shape.text)
    new_list=" ".join(textrun)
    text_list.append(new_list)

print(text_list)

我想根据提取的数据在幻灯片中的出现顺序过滤一些数据。函数根据什么来决定顺序？解决该问题应该怎么做？

Answer 1

史蒂夫的评论很正确；返回的形状：

for shape in slide.shapes:
    ...

按照基础XML的文档顺序，这也是建立 z顺序的原因。 Z顺序是“堆叠”顺序，就像每个形状都在单独的透明薄片（层）上一样，第一个返回的形状在底部，每个后续形状都添加到堆叠的顶部（并重叠在其下方）

我认为您想要的是从左到右，从上到下的内容。您需要使用shape.left和shape.top编写自己的代码以按此顺序对形状进行排序。

类似这样的方法可能会解决问题：

def iter_textframed_shapes(shapes):
    """Generate shape objects in *shapes* that can contain text.

    Shape objects are generated in document order (z-order), bottom to top.
    """
    for shape in shapes:
        # ---recurse on group shapes---
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            group_shape = shape
            for shape in iter_textable_shapes(group_shape.shapes):
                yield shape
            continue

        # ---otherwise, treat shape as a "leaf" shape---
        if shape.has_text_frame:
            yield shape

textable_shapes = list(iter_textframed_shapes(slide.shapes))
ordered_textable_shapes = sorted(
    textable_shapes, key=lambda shape: (shape.top, shape.left)
)

for shape in ordered_textable_shapes:
    print(shape.text)

如何使用python-pptx在演示文稿中按顺序从PowerPoint文本框中提取文本。

1 个答案: