Question

我已经开始尝试aws-textract，特别是detect-document-text（文档：https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html）。例如，图片内容为：

This is the first line
should continue here.

This is the second line.

detect-document-text的输出返回一个JSON，其中每个BlockType节点是WORD，LINE或PAGE（其他一些元素像Relationships一样被附加，其中定义了type和Id的列表，Geometry信息（坐标），Confidence等）。在这种情况下，输出（按预期）将为每一行包含一个BlockType（LINE），如下所示：

{
...
  {
    ...
    "BlockType": "LINE",
    "Confidence": 97.8960189819336,
    "Text": "This is the first line",
    ...
  },
  {
    ...
    "BlockType": "LINE",
    "Confidence": 97.8960189819336,
    "Text": "should continue here.",
   ...
  },
  {
    ...
    "BlockType": "LINE",
    "Confidence": 97.8960189819336,
    "Text": "This is the second line.",
   ...
  },
  ...
}

我的问题是下一个，是否有一个可以覆盖的参数（例如行或单元格的跨度值以通过“句子”保留单个节点）或一种按段落将行分组的选项（基于计算出来的坐标），以便有完整的句子？还是这是客户端的强制性后处理？想知道这似乎是一种常见的情况，因此尝试使用textract输出aws来查找textract或其他JSON服务是否已经提供了它。

Answer 1

查看Textract DetectDocumentText API，请求语法仅接受Document和S3Object作为参数

import React, { useState } from "react";
import { Carousel } from "react-responsive-carousel";

export default () => {
  const [intervalz, setIntervalz] = useState(3000); //initial state here represents the interval for first image.

  const onChange = (index, item) => {
    setIntervalz(item.props["data-interval"]);
  };
  return (
    <Carousel
      onChange={onChange}
      autoPlay
      interval={intervalz}
      infiniteLoop={true}
    >
      <div data-interval={3000}>
        <img alt="" src="http://lorempixel.com/output/cats-q-c-640-480-1.jpg" />
        <p className="legend">Legend 1</p>
      </div>
      <div data-interval={5000}>
        <img alt="" src="http://lorempixel.com/output/cats-q-c-640-480-2.jpg" />
        <p className="legend">Legend 2</p>
      </div>
      <div data-interval={1000}>
        <img alt="" src="http://lorempixel.com/output/cats-q-c-640-480-3.jpg" />
        <p className="legend">Legend 3</p>
      </div>
    </Carousel>
  );
};

即，这意味着没有其他参数可与API配合使用来将JSON输出转换为按段落分组的行

如果您希望处理输出以便按段落对行进行分组，则需要构建自己的逻辑。

希望这会有所帮助！

Answer 2

如syumaK的答案所述，Textstract API不支持此功能。考虑使用诸如Google Vision API之类的替代服务，该服务通常会为您提供整个段落，而不只是行。

或者，考虑通常如何在页面上布置文本。同一段落的线条部分趋向于具有相似的宽度和相似的高度，根据所使用的对齐方式，它们将共享相似的左，中或右x位置，通常y方向上的线间距会较小大于线高的2倍。您可以一次将搜索范围限制为单个页面。建立像r树之类的空间搜索索引可能会提高页面搜索速度，从而使您受益。

对不起，没有代码，但这应该构成构建行块聚合功能的一个很好的框架。

aws textract-按段落将输出行分组

2 个答案: