Question

我正在尝试从s3存储桶中检索.doc文件，并使用textract读取其文本。为此，我创建了以下两个函数：

def process_files(filepath):
   s3 = s3fs.S3FileSystem()
   filename = 's3://' + bucket_name + '/' + filepath
   _, ext = os.path.splitext(filename)
   if ext == '.pdf':
       extract_string = pdf_to_string(s3, filename)
       return extract_string
   elif ext == '.doc':
       extract_string = doc_to_string(s3, filename)
       return extract_string

def doc_to_string(s3_file, filename):
   """
   convert an .doc or .docs file into string
   """
   print(filename)
   print(s3_file.ls('/myname/test_files/*'))
   text = textract.process(filename)

   return text

但是，我遇到了错误：

这是/至/文件/您/想要/至/extract.doc的正确路径

因此，我更改了代码以更改路径：

def doc_to_string(s3_file, filename):
   """
   convert an .doc or .docs file into string
   """
   text = textract.process(s3_file.ls('/myname/test_files/*'))
   return text

但是我得到了

路径应为字符串字节或类似os.path

从S3存储桶中使用textract提取文本

0 个答案: