我正在尝试从s3存储桶中检索.doc
文件,并使用textract
读取其文本。为此,我创建了以下两个函数:
def process_files(filepath):
s3 = s3fs.S3FileSystem()
filename = 's3://' + bucket_name + '/' + filepath
_, ext = os.path.splitext(filename)
if ext == '.pdf':
extract_string = pdf_to_string(s3, filename)
return extract_string
elif ext == '.doc':
extract_string = doc_to_string(s3, filename)
return extract_string
def doc_to_string(s3_file, filename):
"""
convert an .doc or .docs file into string
"""
print(filename)
print(s3_file.ls('/myname/test_files/*'))
text = textract.process(filename)
return text
但是,我遇到了错误:
这是/至/文件/您/想要/至/extract.doc的正确路径
因此,我更改了代码以更改路径:
def doc_to_string(s3_file, filename):
"""
convert an .doc or .docs file into string
"""
text = textract.process(s3_file.ls('/myname/test_files/*'))
return text
但是我得到了
路径应为字符串字节或类似os.path