修改摄取的pdf的内容

时间:2017-04-03 09:06:02

标签: elasticsearch

我在elasticsearch中创建了一个管道,用于使用pdf数组提取文档。我想修改内容字段,以便在结尾处连接其他字段以进行搜索。

我的管道是:

client.ingest.putPipeline({
  id: 'my-pipeline-id',
  body: {
    "description" : "Extract attachment information",
    "processors" : 
    [
      {
        "foreach": {
          "field": "attachments",
          "processor": {
            "attachment": {
              "target_field": "_ingest._value.attachment",
              "field": "_ingest._value.data"
            }
          }
        }
      }
    ]
  }
}, callback);

我无法在foreach之后添加一个set处理器,因为我需要访问每个pdf的内容,以便将该文档的值放在内容的末尾。

一些示例文档是:

let doc = {
    matricula: '6789AAA',
    bastidor: 'BASTIDOR789',
    expediente: '79',
    attachments:
    [
        {
            filename: "informe",
            data: /* chunk of data in base64 */
        },
        {
            filename: "ivtm_diba",
            data: /* another chunk of data in base64 */
        }
    ]
};

结果文档如下所示:

{
    "_index": "doc",
    "_type": "document",
    "_id": "AVsy85rwMuPe74hQBT8L",
    "_score": 1.2039728,
    "_source": {
      "attachments": [
        {
          "filename": "informe",
          "attachment": {
            "Very very long content",
            "date": "2016-06-08T14:01:25Z",
            "content_type": "application/pdf",
            "language": "es",
            "content_length": 3124
          }
        },
        {
          "filename": "ivtm_diba",
          "attachment": {
            "content": "Very long content here",
            "content_type": "application/pdf",
            "language": "ca",
            "content_length": 5657
          }
        }
      ],
      "expediente": "79",
      "matricula": "6789ZXC",
      "bastidor": "BASTIDOR789"
    }
  }

我想在内容字段中添加" bastidor"," matricula"的值。和#34; expediente"字段。

我使用的是elasticsearch-js,但这不是必需的。

1 个答案:

答案 0 :(得分:1)

elasticsearch _all field可以在大多数情况下使用。