使用Google Cloud Storage将PDF转换为.txt文件

时间:2018-07-26 21:55:29

标签: pdf google-cloud-storage

我在本地文件系统上有此Python代码。

os.getcwd()os.listdir的等效Python对象API是什么?

我希望此代码能使用来自GCS的文件吗?

为了使用GCS文件夹,我提供了此代码

from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
pdfDir = bucket.get_blob('uploads/pdf/')
txtDir = bucket.get_blob('uploads/txt/')

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt

#converts pdf, returns its text content as a string
def convert(fname, pages=None):
if not pages:
    pagenums = set()
else:
    pagenums = set(pages)

output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)

infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
    interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text 

#converts all pdfs in directory pdfDir, saves all resulting txt files to 
txtdir
def PDF2txt(pdfDir, txtDir):
if pdfDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in 
for pdf in os.listdir(pdfDir): #iterate through pdfs in pdf directory
    fileExtension = pdf.split(".")[-1]
    if fileExtension == "pdf":
        pdfFilename = pdfDir + pdf 
        text = convert(pdfFilename) #get string of text content of pdf
        textFilename = txtDir + pdf + ".txt"
        textFile = open(textFilename, "w") #make text file
        textFile.write(text) #write text to text file

pdfDir = "C:/pdftotxt/pdfs/"
txtDir = "C:/pdftotxt/txt/"
PDF2txt(pdfDir, txtDir)

1 个答案:

答案 0 :(得分:0)

我假设您要列出存储桶中的对象以及存储桶中特定文件夹中的对象。为此,您可以直接使用Google Cloud Storage提供的Python客户端库。使用bucket.list_blobs()列出整个存储桶,使用bucket.list_blobs(prefix=prefix, delimiter=delimiter)列出特定的文件夹或对象。

更详细的文档可以在[1]处找到,Git存储库包含整个库在这里[2]