Elasticsearch中的带状疱疹,它们尊重标点符号

时间:2015-06-15 08:25:21

标签: elasticsearch lucene

我正在为Elasticsearch中的英国地址构建一个地址匹配引擎,并且发现带状疱疹非常有用但是我在标点符号时遇到了一些问题。查询" 4 Walmley Close"正在返回以下比赛:

  1. 单位3和4,Walmley Chambers,3 Walmley Close
  2. Flat 4,Walmley Court,10 Walmley Close
  3. 合作零售服务有限公司,4 Walmley Close
  4. 真正的比赛是3号,但是1和2都匹配(错误地),因为他们都变成了#4; 4 walmley'什么时候变成了带状疱疹。我想告诉shingle分析仪不会产生跨越逗号的带状疱疹。所以,例如1)目前我得到:

    • units 3
    • 3和
    • 和4
    • 4 walmley
    • walmley chamber
    • chamber 3
    • 3 walmley
    • walmley close

    ......实际上我想要的只是......

    • units 3
    • 3和
    • 和4
    • walmley chamber
    • 3 walmley
    • walmley close

    我目前的设置如下。我已经尝试将标记生成器从标准交换到空格,这有助于它保留逗号并可能避免上述情况(即我最终使用' 4,walmley'作为我在地址1和2)但是我的索引中有很多无法使用的带状疱疹,我需要7000万个文件来保持索引大小。

    正如您在索引设置中所看到的,我还有一个street_sym过滤器,我希望能够在我的带状疱疹中使用,例如对于这个例子,除了生成'walmley close'我想要' walmley cl'然而,当我试图将其包括在内时,我得到了一些关于这种情况的感觉。这不是非常有帮助!

    来自更有经验的Elasticsearch用户的任何建议都将受到极大的赞赏。我读过Gormley和Tong的优秀书籍,但无法理解这一特定问题。

    提前感谢您提供的任何帮助。

    "analysis": {
        "filter": {
            "shingle": {
                "type": "shingle",
                "output_unigrams": false
            },
            "street_sym": {
                "type": "synonym",
                "synonyms": [
                    "st => street",
                    "rd => road",
                    "ave => avenue",
                    "ct => court",
                    "ln => lane",
                    "terr => terrace",
                    "cir => circle",
                    "hwy => highway",
                    "pkwy => parkway",
                    "cl => close",
                    "blvd => boulevard",
                    "dr => drive",
                    "ste => suite",
                    "wy => way",
                    "tr => trail"
                ]
            }
        },
        "analyzer": {
            "shingle": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "shingle"
                ]
            }
        }
    }
    

1 个答案:

答案 0 :(得分:0)

请参阅我对您的问题的评论,为什么解决方案仍然不会阻止“4 Walmley Close”匹配您提供的所有三场比赛。但是,至少可以获得所需的令牌。我不确定它是最优雅/高性能的解决方案,但在你的带状疱疹上使用Pattern ReplacePattern CaptureLength过滤器似乎可以解决问题:

"analysis": {
    "filter": {
        "shingle": {
            "type": "shingle",
            "output_unigrams": false
        },
        "street_sym": {
            "type": "synonym",
            "synonyms": [
                "st => street",
                "rd => road",
                "ave => avenue",
                "ct => court",
                "ln => lane",
                "terr => terrace",
                "cir => circle",
                "hwy => highway",
                "pkwy => parkway",
                "cl => close",
                "blvd => boulevard",
                "dr => drive",
                "ste => suite",
                "wy => way",
                "tr => trail"
            ]
        },
        "no_middle_comma": {
            "type": "pattern_replace",
            "pattern": ".+,.+",
            "replacement": "" 
        },
        "no_trailing_comma": {
            "type": "pattern_capture",
            "preserve_original": false,
            "patterns": [
                "(.*),"
            ]
        },
        "not_empty": {
            "type": "length",
            "min": 1
        }
    },
    "analyzer": {
        "test": {
            "type": "custom",
            "tokenizer": "whitespace",
            "filter": [
                "lowercase",
                "street_sym",
                "shingle",
                "no_middle_comma",
                "no_trailing_comma",
                "not_empty"
            ]
        }
    }
}
  • no_middle_comma:用空标记替换任何带有逗号的令牌
  • no_trailing_comma:用逗号前的部分替换以逗号结尾的任何令牌
  • not_empty:删除上述
  • 产生的所有空标记

例如,“3号和4号单位,Walmley Chambers,3 Walmley Cl”成为:

{
   "tokens": [
      {
         "token": "units 3",
         "start_offset": 0,
         "end_offset": 7,
         "type": "shingle",
         "position": 0
      },
      {
         "token": "3 and",
         "start_offset": 6,
         "end_offset": 11,
         "type": "shingle",
         "position": 1
      },
      {
         "token": "and 4",
         "start_offset": 8,
         "end_offset": 14,
         "type": "shingle",
         "position": 2
      },
      {
         "token": "walmley chambers",
         "start_offset": 15,
         "end_offset": 32,
         "type": "shingle",
         "position": 4
      },
      {
         "token": "3 walmley",
         "start_offset": 33,
         "end_offset": 42,
         "type": "shingle",
         "position": 6
      },
      {
         "token": "walmley close",
         "start_offset": 35,
         "end_offset": 45,
         "type": "shingle",
         "position": 7
      }
   ]
}

请注意,您的同义词过滤器有效:“Walmley Cl”变成了“walmley close”。