如何将python getter方法转换为elasticsearch查询?

时间:2019-04-12 09:47:56

标签: python json elasticsearch

我想翻译python方法以将特定术语从抓取的网站转换为Elasticsearch查询。

我正在从事网络爬虫和Elasticsearch(及其他工作..)方面的实习,对这个领域(以及一般编程领域)我是一个新手

我的任务是删除国家/地区代码,然后进行查询以使用其他国家/地区代码获取国家/地区代码,例如:

澳大利亚的2个字符的国家/地区代码是:“ AU” 它的三个字符的国家/地区代码是:“ AUS”

因此,通过指定“ AU”,我想使用“ AUS”代码。

为此,我抓取了所有国家/地区的列表代码,并制作了python代码以获取此结果,下面是一个示例:

NameError: name 'get_env_variable' is not defined
python manage.py runserver
Performing system checks...

System check identified no issues (0 silenced).
Unhandled exception in thread started by <function check_errors.<locals>.wrapper at 0x0495CA98>
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 216, in ensure_connection
    self.connect()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 194, in connect
    self.connection = self.get_new_connection(conn_params)
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\postgresql\base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\psycopg2\__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\utils\autoreload.py", line 225, in wrapper
    fn(*args, **kwargs)
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\core\management\commands\runserver.py", line 120, in inner_run
    self.check_migrations()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\core\management\base.py", line 442, in check_migrations
    executor = MigrationExecutor(connections[DEFAULT_DB_ALIAS])
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\migrations\executor.py", line 18, in __init__
    self.loader = MigrationLoader(self.connection)
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\migrations\loader.py", line 49, in __init__
    self.build_graph()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\migrations\loader.py", line 212, in build_graph
    self.applied_migrations = recorder.applied_migrations()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\migrations\recorder.py", line 61, in applied_migrations
    if self.has_table():
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\migrations\recorder.py", line 44, in has_table
    return self.Migration._meta.db_table in self.connection.introspection.table_names(self.connection.cursor())
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 255, in cursor
    return self._cursor()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 232, in _cursor
    self.ensure_connection()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 216, in ensure_connection
    self.connect()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\utils.py", line 89, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 216, in ensure_connection
    self.connect()
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\base\base.py", line 194, in connect
    self.connection = self.get_new_connection(conn_params)
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\django\db\backends\postgresql\base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "C:\Program Files (x86)\Python37-32\lib\site-packages\psycopg2\__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError

所以基本上我想将上面的代码转换为请求,然后在网页上实现以供内部使用

我是初学者,请尽可能明确。

1 个答案:

答案 0 :(得分:0)

假设为文档建立索引时使用了默认的动态映射,则所有strings都应同时映射为text类型和keyword类型。因此,在term映射上进行简单的keyword查询就可以得到您想要的结果。

例如,使用默认设置创建索引的步骤很简单:

PUT countries-codes

为提供的文档编制索引将如下所示:

POST countries-codes/event
{
  "name": "Albanie",
  "alpha_2": "AL",
  "alpha_3": "ALB",
  "num": "8"
}

现在,我们可以查看索引的映射,以了解Elasticsearch如何在内部映射字段:

GET countries-codes/_mapping

结果:

{
  "countries-codes" : {
    "mappings" : {
      "event" : {
        "properties" : {
          "alpha_2" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "alpha_3" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "name" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "num" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

现在,我们只需对2个字符的国家/地区代码的term映射进行一次keyword查询,我们将获得一个表示匹配项的文档(或者在某种程度上存在多个匹配项的情况下, ,代表这些匹配项的所有文档):

GET countries-codes/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "alpha_2.keyword": "AL"
        }
      }
    }
  }
}

请注意,这是一个过滤的查询,因为您对计分不感兴趣。简而言之,筛选器上下文将比查询上下文快,因此请尽可能使用它。有关更多信息,请参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html

这将产生您之前发布的文档,位于hits返回数组中:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "countries-codes",
        "_type" : "event",
        "_id" : "qGDmEWoBqkB-aMRpdfvt",
        "_score" : 0.0,
        "_source" : {
          "name" : "Albanie",
          "alpha_2" : "AL",
          "alpha_3" : "ALB",
          "num" : "8"
        }
      }
    ]
  }
}

任何提交的不匹配项都会产生一个空的hits数组。在客户端,您可以仅解析所需的元素。如果您有非常大的文档或要退回大量文档,则需要查看source filtering-https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-source-filtering.html

例如:

GET countries-codes/_search
{
  "_source": "alpha_3", 
  "query": {
    "bool": {
      "filter": {
        "term": {
          "alpha_2.keyword": "AL"
        }
      }
    }
  }
}

在返回的匹配对象中,您只会注意到所需的结果是从文档中返回的:

"hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [
      {
        "_index" : "countries-codes",
        "_type" : "event",
        "_id" : "qGDmEWoBqkB-aMRpdfvt",
        "_score" : 0.0,
        "_source" : {
          "alpha_3" : "ALB"
        }
      }
    ]
  }

所有示例均使用开发工具/简单的API调用显示。由于您使用的是Python,请查看正式维护的Elasticsearch库:

Elasticsearch DSL-建立在较低级别的Elasticsearch-Py之上-https://elasticsearch-dsl.readthedocs.io/en/latest/

Elasticsearch-Py-https://elasticsearch-py.readthedocs.io/en/master/