实际工作

Question

我试图找到一种方法来查找文件夹并搜索该文件夹中所有powerpoint文档的内容以获取特定字符串，最好使用Python。当找到这些字符串时，我想报告该字符串后面的文本以及找到它的文档。我想编译信息并将其报告为CSV文件。

到目前为止，我只遇到过olefil包，https://bitbucket.org/decalage/olefileio_pl/wiki/Home。这提供了特定文档中包含的所有文本，这不是我想要做的。请帮忙。

Answer 1

python-pptx可用于执行您的建议。只是在高层次，你会做这样的事情（不是工作代码，只是整体方法的想法）：

from pptx import Presentation

for pptx_filename in directory:
    prs = Presentation(pptx_filename)
    for slide in prs.slides:
        for shape in slide.shapes:
            print shape.text

您需要添加关于搜索关键字符串的形状文本并将其添加到CSV文件或其他内容的位，但这种一般方法应该可以正常工作。我会留给你研究更好的观点：）

Answer 2

实际工作

from pptx import Presentation
import os


files = [x for x in os.listdir() if x.endswith(".pptx")]


for eachfile in files:
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)

Answer 3

tika-python

Apache Tika库的Python端口，根据文档，Apache tika支持从1500多种文件格式中提取文本。

注意：它也可以与 pyinstaller

使用pip安装：

pip install tika

示例：

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

链接到官方GitHub

使用python从多个powerpoint文件中提取文本

3 个答案:

实际工作