Question

我想使用Textract OCR服务从pdf文件读取文本。我有一个问题，因为我想在本地做，而没有S3存储桶。我对它进行了图像文件测试，效果很好，但不适用于PDF文件。

这是我收到错误的代码：

response = textract.start_document_text_detection(DocumentLocation="sample2.pdf")

错误：

Invalid type for parameter DocumentLocation, value: sample2.pdf, type: <class 'str'>, valid types: <class 'dict'>

代码2：

response = textract.start_document_text_detection(DocumentLocation={"name":"sample2.pdf"})

错误：

Unknown parameter in DocumentLocation: "name", must be one of: S3Object

代码3：

response = textract.start_document_text_detection(Document={'Bytes': "sample2.pdf"})

错误：

Unknown parameter in input: "Document", must be one of: DocumentLocation, ClientRequestToken, JobTag, NotificationChannel, OutputConfig

我该怎么办，有没有办法使Textract在不使用s3的情况下适用于PDF文档？

Answer 1

您问题的简短答案是“否”。

从本质上讲，该服务需要结构化的输入，您需要根据其规范正确填写。这是boto3期望的DocumentLocation字典输入。

DocumentLocation={
    'S3Object': {
        'Bucket': 'string',
        'Name': 'string',
        'Version': 'string'
    }
}

我目前在boto3中也遇到了类似的问题，但是我将继续通过文档工作，以找出可以解决的问题。