我的PowerPoint幻灯片由文本框组成,有时包含在组形状中。从这些数据提取数据时,不会按顺序提取文本。 有时会首先提取ppt末尾的文本框,有时会提取中间的文本框,依此类推。
以下代码从文本框获取文本,并且也处理组对象。
for eachfile in files:
prs = Presentation(eachfile)
textrun=[]
# ---Only on text-boxes outside group elements---
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
print(shape.text)
textrun.append(shape.text)
# ---Only operate on group shapes---
group_shapes = [shp for shp in slide.shapes
if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
for group_shape in group_shapes:
for shape in group_shape.shapes:
if shape.has_text_frame:
print(shape.text)
textrun.append(shape.text)
new_list=" ".join(textrun)
text_list.append(new_list)
print(text_list)
我想根据提取的数据在幻灯片中的出现顺序过滤一些数据。 函数根据什么来决定顺序? 解决该问题应该怎么做?
答案 0 :(得分:1)
史蒂夫的评论很正确;返回的形状:
for shape in slide.shapes:
...
按照基础XML的文档顺序 ,这也是建立 z顺序的原因。 Z顺序是“堆叠”顺序,就像每个形状都在单独的透明薄片(层)上一样,第一个返回的形状在底部,每个后续形状都添加到堆叠的顶部(并重叠在其下方)
我认为您想要的是从左到右,从上到下的内容。您需要使用shape.left
和shape.top
编写自己的代码以按此顺序对形状进行排序。
类似这样的方法可能会解决问题:
def iter_textframed_shapes(shapes):
"""Generate shape objects in *shapes* that can contain text.
Shape objects are generated in document order (z-order), bottom to top.
"""
for shape in shapes:
# ---recurse on group shapes---
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
group_shape = shape
for shape in iter_textable_shapes(group_shape.shapes):
yield shape
continue
# ---otherwise, treat shape as a "leaf" shape---
if shape.has_text_frame:
yield shape
textable_shapes = list(iter_textframed_shapes(slide.shapes))
ordered_textable_shapes = sorted(
textable_shapes, key=lambda shape: (shape.top, shape.left)
)
for shape in ordered_textable_shapes:
print(shape.text)