如何在python中使用ElasticSearch-dsl自定义同义词令牌过滤器?

时间:2020-05-04 01:30:26

标签: python elasticsearch elasticsearch-dsl

我正在尝试使用python中的ElasticSearch-dsl构建同义词标记筛选器,例如,当我尝试搜索“小”或“小”时,它还会返回包含“小”的文章。 这是我的代码:

from elasticsearch_dsl import token_filter

# Connect to local host server
connections.create_connection(hosts=['127.0.0.1'])

spelling_tokenfilter = token_filter(
    'my_tokenfilter', # Name for the filter
    'synonym', # Synonym filter type
    synonyms_path = "analysis/wn_s.pl"
    )

# Create elasticsearch object
es = Elasticsearch()

text_analyzer = analyzer('my_tokenfilter',
                         type='custom',
                         tokenizer='standard',
                         filter=['lowercase', 'stop', spelling_tokenfilter])

我在es-7.6.2 / config中创建了一个名为“ analysis”的文件夹,并下载了Wordnet prolog数据库,并将“ wn_s.pl”复制并粘贴到其中。但是当我运行程序时,出现了错误:

Traceback (most recent call last):
  File "index.py", line 161, in <module>
    main()
  File "index.py", line 156, in main
    buildIndex()
  File "index.py", line 74, in buildIndex
    covid_index.create()
  File "C:\Anaconda\lib\site-packages\elasticsearch_dsl\index.py", line 259, in create
    return self._get_connection(using).indices.create(index=self._name, body=self.to_dict(), **kwargs)
  File "C:\Anaconda\lib\site-packages\elasticsearch\client\utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "C:\Anaconda\lib\site-packages\elasticsearch\client\indices.py", line 104, in create
    "PUT", _make_path(index), params=params, headers=headers, body=body
  File "C:\Anaconda\lib\site-packages\elasticsearch\transport.py", line 362, in perform_request
    timeout=timeout,
  File "C:\Anaconda\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 248, in perform_request
    self._raise_error(response.status, raw_data)
  File "C:\Anaconda\lib\site-packages\elasticsearch\connection\base.py", line 244, in _raise_error
    status_code, error_message, additional_info
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'failed to build synonyms')

有人知道如何解决它? 谢谢!

1 个答案:

答案 0 :(得分:1)

之所以会这样,是因为您在同义词过滤器(docs)之前定义了lowercasestop令牌过滤器:

Elasticsearch将使用令牌生成器链中同义词过滤器之前的令牌过滤器来解析同义词文件中的条目。因此,例如,如果将同义词过滤器放在词干之后,则该词干也将应用于同义词条目。

首先,让我们尝试通过捕获异常来获取有关该错误的更多详细信息:

>>> text_analyzer = analyzer('my_tokenfilter',
...                          type='custom',
...                          tokenizer='standard',
...                          filter=[
...                              'lowercase', 'stop',
...                              spelling_tokenfilter
...                              ])
>>>
>>> try:
...   text_analyzer.simulate('blah blah')
... except Exception as e:
...   ex = e
...
>>> ex
RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms'}], 'type': 'illegal_argument_exception', 'reason': 'failed to build synonyms', 'caused_by': {'type': 'parse_exception', 'reason': 'Invalid synonym rule at line 109', 'caused_by': {'type': 'illegal_argument_exception', 'reason': 'term: course of action analyzed to a token (action) with position increment != 1 (got: 2)'}}}, 'status': 400})

这部分特别有趣:

'reason':'第109行的无效同义词规则','caused_by':{'type':'illegal_argument_exception','reason':'term:将行动过程分析为具有位置增量的标记(行动)! = 1(got:2)'}}}

这暗示它设法找到了文件,但是未能解析它。

最后,如果您删除了这两个令牌过滤器,该错误就会消失:

text_analyzer = analyzer('my_tokenfilter',
                         type='custom',
                         tokenizer='standard',
                         filter=[
                             #'lowercase', 'stop',
                             spelling_tokenfilter
                             ])
...
>>> text_analyzer.simulate("blah")
{'tokens': [{'token': 'blah', 'start_offset': 0, 'end_offset...}

文档建议使用multiplexer token filter,以防您需要将它们结合使用。

希望这会有所帮助!