Question

我有以下文字：

Lurasidone is a dopamine D<sub>2</sub>

我想对它进行标记，以便获得以下标记：

鲁拉西酮

多巴胺

D2

如何使用标记器或过滤器实现此目的？我试图使用html过滤器，但D<sub>2</sub>被标记为：

d

2

而我需要将它标记为：

D2

Answer 1

您可以使用Pattern Replace Char Filter

这就是我所做的。

"char_filter": {
    "html_pattern": {
        "type": "pattern_replace",
        "pattern": "<.*>(.*)<\\/.*>",
        "replacement": "$1"
    }
}

我在custom analyzer这样包含了

"my_custom_analyzer": {
    "tokenizer": "standard",
    "char_filter": [
        "html_pattern"
    ],
    "filter": ["stop"]
}

这些是为您的文字生成的代币

{
   "tokens": [
      {
         "token": "Lurasidone",
         "start_offset": 0,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "dopamine",
         "start_offset": 16,
         "end_offset": 24,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "D2",
         "start_offset": 25,
         "end_offset": 38,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

我希望这会有所帮助。

使用标记对文本进行标记/过滤

1 个答案: