Question

以下是我在Elasticsearch中索引pdf网址的代码：

import requests
from elasticsearch import Elasticsearch
es = Elasticsearch()
body = {
     "description" : "Extract attachment information",
     "processors" : [
        {
            "attachment" : {
            "field" : "data"
        }
      }
 ]
}
es.index(index='_ingest', doc_type='pipeline', id='attachment', body=body)
url = 'https://pubs.vmware.com/nsx-63/topic/com.vmware.ICbase/PDF/nsx_63_cross_vc_install.pdf'
response = requests.get(url)
import base64

data = base64.b64encode(response.content).decode('ascii')
 result2 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
              body={'data': data})
 result2
 doc = es.get(index='my_index', doc_type='my_type', id=result2['_id'],   _source_exclude=['data'])
 doc
 print(doc['_source']['attachment']['content'])

最后一行是将pdf文件的内容打印到63页中仅有63页。我是否需要在某处更改任何设置（已尝试增加控制台o / p，dint帮助）。

请提供相关信息。

Answer 1

提取的100000个字符数限制。您可以通过设置indexed_chars。

在管道定义中更改它

请参阅https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

在ES中索引pdf文件后无法看到整个pdf内容

1 个答案: