Elasticsearch中的多重正则表达式

时间:2017-06-01 06:39:18

标签: regex elasticsearch

这是我的正则表达式......

^ +|( +; +)| +$

以下是带有测试字符串的正则表达式的屏幕截图

Regex and Test String

我使用截图来突出显示空格......

我想要做的就是像这样格式化字符串

Trimester 1;Trimester 2;Trimester 3

所以我想

  1. 从字符串的开头和结尾删除空格
  2. 删除分号前后的空格。
  3. 这是我的自定义分析器...

    "analysis": {
                "analyzer": {
                    "semi_colon_analyzer": {
                        "tokenizer": "my_tokenizer"
                    },
                    "comma_analyzer": {
                        "type": "pattern",
                        "pattern": ",",
                        "lowercase": false
                    }
                },
                "tokenizer": {
                    "my_tokenizer": {
                        "type": "pattern",
                        "pattern": "( +; +)",
                        "replacement": "$1;"
                    }
                }
    
            }
    

    这适用于regex101.com,但在Elastic中不起作用。

    有人可以帮助您了解如何在ElasticSearch中实现此Regex吗?

    由于

    修改

    _analyze?analyzer = semi_colon_analyzer

    的输出
    {
      "tokens": [
        {
          "token": "Trimester",
          "start_offset": 0,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "1",
          "start_offset": 10,
          "end_offset": 11,
          "type": "<NUM>",
          "position": 1
        },
        {
          "token": "Trimester",
          "start_offset": 13,
          "end_offset": 22,
          "type": "<ALPHANUM>",
          "position": 2
        },
        {
          "token": "2",
          "start_offset": 23,
          "end_offset": 24,
          "type": "<NUM>",
          "position": 3
        },
        {
          "token": "Trimester",
          "start_offset": 26,
          "end_offset": 35,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "3",
          "start_offset": 36,
          "end_offset": 37,
          "type": "<NUM>",
          "position": 5
        }
      ]
    }
    

2 个答案:

答案 0 :(得分:0)

我认为您需要使用char_filter。试试这个,

 {
 "analysis": {
  "analyzer": {
     "semi_colon_analyzer": {
        "char_filter": "my_char_filter",
        "tokenizer": "my_tokenizer",
        "filter" : "trim"
     },
     "comma_analyzer": {
        "type": "pattern",
        "pattern": ",",
        "lowercase": false
     }
  },
  "char_filter": {
     "my_char_filter": {
        "type": "pattern_replace",
        "pattern": "(\\s+;\\s+)",
        "replacement": ";"
     }
  },
  "tokenizer": {
     "my_tokenizer": {
        "type": "pattern",
        "pattern": ";"
     }
     }
   }
 }

如果您使用上面创建的分析器分析Trimester 1 ; Trimester 2,您将获得:

{
 "tokens": [
  {
     "token": "Trimester  1",
     "start_offset": 0,
     "end_offset": 12,
     "type": "word",
     "position": 0
  },
  {
     "token": "trimester 2",
     "start_offset": 19,
     "end_offset": 33,
     "type": "word",
     "position": 1
       }
     ]
  }

答案 1 :(得分:0)

我已经通过对原始映射进行一些调整得到了解决方案......

"settings": {
    "number_of_shards": "1",
    "number_of_replicas": "0",
    "analysis": {
        "analyzer": {
            "semi_colon_analyzer": {
                "tokenizer": "my_tokenizer"
            },
            "comma_analyzer": {
                "type": "pattern",
                "pattern": ",",
                "lowercase": false
            }
        },
        "tokenizer": {
            "my_tokenizer": {
                "type": "pattern",
                 "pattern": "^ +|( *; *)| +$",
                "replacement": "$1;"
            }
        }

    }
},