ElasticSearch和Python:搜索功能问题

时间:2018-10-22 12:22:47

标签: python django elasticsearch

我正尝试将ElasticSearch 6.4与在Python/Django中编写的现有Web应用程序一起使用。我有一些问题,我想了解为什么以及如何解决这些问题。

###########

#现有:#

###########

在我的应用程序中,可以上传文档文件(例如.pdf或.doc)。然后,我在我的应用程序中有了一个搜索功能,该功能可以在上载ElasticSearch索引的文档时进行搜索。

文档标题始终以相同的方式书写:

YEAR - DOC_TYPE - ORGANISATION - document_title.extension

例如:

1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf

搜索功能始终在doc_type = ANNUAL_REPORT之间完成。因为存在多种doc_type(ANNUAL_REPORT,OTHERS等)。

##################

#我的环境:#

##################

根据我的ElasticSearch部分,这是一些数据。我也在学习ES命令。

$ curl -XGET http://127.0.0.1:9200/_cat/indices?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   app  5T0HZTbmQU2-ZNJXlNb-zg   5   1        742            2    396.4kb        396.4kb

所以我的索引是app

对于上面的示例,如果我搜索此文档:1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf,则具有:

$ curl -XGET http://127.0.0.1:9200/app/annual-report/1343?pretty
{
  "_index" : "app",
  "_type" : "annual-report",
  "_id" : "1343",
  "_version" : 33,
  "found" : true,
  "_source" : {
    "attachment" : {
      "date" : "2010-03-04T12:08:00Z",
      "content_type" : "application/pdf",
      "author" : "manshanden",
      "language" : "et",
      "title" : "Microsoft Word - Test document Word.doc",
      "content" : "some text ...",
      "content_length" : 3926
    },
    "relative_path" : "app_docs/APP-TEST/1970_ANNUAL_REPORT_APP-TEST_1342.pdf",
    "title" : "1970_ANNUAL_REPORT_APP-TEST_1342 - loremipsum.pdf"
  }
}

现在,在我的Web应用程序的搜索部分中,我希望通过以下搜索找到此文档:1970

def search_in_annual(self, q):
    try:
        response = self.es.search(
            index='app', doc_type='annual-report',
            q=q, _source_exclude=['data'], size=5000)
    except ConnectionError:
        return -1, None

    total = 0
    hits = []
    if response:
        for hit in response["hits"]["hits"]:
            hits.append({
                'id': hit['_id'],
                'title': hit['_source']['title'],
                'file': hit['_source']['relative_path'],
            })

        total = response["hits"]["total"]

    return total, hits

但是当q=1970时,结果为0

如果我写:

response = self.es.search(
                index='app', doc_type='annual-report',
                q="q*", _source_exclude=['data'], size=5000)

它返回我的文档,但也返回很多文档,在标题或文档内容中都没有1970

#################

#我的全局代码:#

#################

这是管理索引功能的全局类:

class EdqmES(object):
    host = 'localhost'
    port = 9200
    es = None

    def __init__(self, *args, **kwargs):
        self.host = kwargs.pop('host', self.host)
        self.port = kwargs.pop('port', self.port)

        # Connect to ElasticSearch server
        self.es = Elasticsearch([{
            'host': self.host,
            'port': self.port
        }])

    def __str__(self):
        return self.host + ':' + self.port

    @staticmethod
    def file_encode(filename):
        with open(filename, "rb") as f:
            return b64encode(f.read()).decode('utf-8')

    def create_pipeline(self):
        body = {
            "description": "Extract attachment information",
            "processors": [
                {"attachment": {
                    "field": "data",
                    "target_field": "attachment",
                    "indexed_chars": -1
                }},
                {"remove": {"field": "data"}}
            ]
        }
        self.es.index(
            index='_ingest',
            doc_type='pipeline',
            id='attachment',
            body=body
        )

    def index_document(self, doc, bulk=False):
        filename = doc.get_filename()

        try:
            data = self.file_encode(filename)
        except IOError:
            data = ''
            print('ERROR with ' + filename)
            # TODO: log error

        item_body = {
            '_id': doc.id,
            'data': data,
            'relative_path': str(doc.file),
            'title': doc.title,
        }

        if bulk:
            return item_body

        result1 = self.es.index(
            index='app', doc_type='annual-report',
            id=doc.id,
            pipeline='attachment',
            body=item_body,
            request_timeout=60
        )
        print(result1)
        return result1

    def index_annual_reports(self):
        list_docs = Document.objects.filter(category=Document.OPT_ANNUAL)

        print(list_docs.count())
        self.create_pipeline()

        bulk = []
        inserted = 0
        for doc in list_docs:
            inserted += 1
            bulk.append(self.index_document(doc, True))

            if inserted == 20:
                inserted = 0
                try:
                    print(helpers.bulk(self.es, bulk, index='app',
                                       doc_type='annual-report',
                                       pipeline='attachment',
                                       request_timeout=60))
                except BulkIndexError as err:
                    print(err)
                bulk = []

        if inserted:
            print(helpers.bulk(
                self.es, bulk, index='app',
                doc_type='annual-report',
                pipeline='attachment', request_timeout=60))

我的文档在提交时被索引,这要感谢带有信号的Django表单:

@receiver(signals.post_save, sender=Document, dispatch_uid='add_new_doc')
def add_document_handler(sender, instance=None, created=False, **kwargs):
    """ When a document is created index new annual report (only) with Elasticsearch and update conformity date if the
    document is a new declaration of conformity

    :param sender: Class which is concerned
    :type sender: the model class
    :param instance: Object which was just saved
    :type instance: model instance
    :param created: True for a creation, False for an update
    :type created: boolean
    :param kwargs: Additional parameter of the signal
    :type kwargs: dict
    """

    if not created:
        return

    # Index only annual reports
    elif instance.category == Document.OPT_ANNUAL:
        es = EdqmES()
        es.index_document(instance)

1 个答案:

答案 0 :(得分:0)

这是我所做的,并且似乎可以正常工作:

{"changed": false, "module_stderr": "Shared connection to 127.0.0.1 closed.
", "module_stdout": "
Traceback (most recent call last):
  File "/tmp/ansible_i8T24e/ansible_module_rabbitmq_queue.py", line 285, in <module>
    main()
  File "/tmp/ansible_i8T24e/ansible_module_rabbitmq_queue.py", line 178, in main
    r = requests.get(url, auth=(module.params['login_user'], module.params['login_password']))
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 487,in send
    raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=15672): Max retries exceeded with url: /api/queues/%2F/feedfiles (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcfdaf8e7d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
", "msg": "MODULE FAILURE", "rc": 1}

它可以搜索标题,前缀和内容,以查找我的文档。