Question

来自textract documentation：Documents for synchronous operations can be in PNG or JPEG format. Documents for asynchronous operations can also be in PDF format.

我有一个Node.js应用程序，其中使用了异步Textract来读取PDF文件。我的代码如下：

import * as AWS from 'aws-sdk';

const textract = new AWS.Textract({ region: '<REGION>' });

export const callTextract = (file: File, uuid: string): Promise<any> => {
  return new Promise<any>((resolve, reject) => {
    const params = {
      Document: {
        Bytes: file,
      },
    };
    textract.detectDocumentText(params, (err, data) => {
      ....
      resolve(data);
    });
  })
}

这里的文件已经从OS读取，并且为Buffer格式。由于前4个字节（Detecting file type from buffer in node js?），我可以确认它是PDF文件：

 <Buffer 25 50 44 46 ... >

我收到的错误是UnsupportedDocumentException。

Answer 1

detectDocumentText()是同步的。异步版本为startDocumentTextDetection()。

请参见doc：

检测输入文档中的文本。 Amazon Textract可以检测文本行以及组成文本行的单词。 输入文档必须是JPEG或PNG格式的图像。

...

DetectDocumentText是一个同步操作。 要异步分析文档，请使用StartDocumentTextDetection。

请注意，语言的异步机制与API的异步调用不同。对于异步API，将始终至少有两个调用。在这种情况下，另一个是getDocumentTextAnalysis()。

...尽管我认为这是不良的AWS文档的另一个示例。

Answer 2

您可以在同步和异步API中都提供一个byte字段，但是在两个API中，bytes字段的定义都是相同的

以base64编码的文档字节的blob。以字节为单位提供的文档的最大大小为5 MB。 文档字节必须为PNG或JPEG格式。

因此，您不能上载PDF格式的字节字段值

摘自文档：https://docs.aws.amazon.com/textract/latest/dg/API_Document.html#API_Document_Contents

Textract异步阅读PDF

2 个答案: