我在ElasticSearch中理解regexp mechanizm时遇到了麻烦。我有代表房产单位的文件:
{
"Unit" :
{
"DailyAvailablity" :
"UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
}
DailyAvailability字段对从今天起的未来两年内的房产可用性进行编码。 ' A'可用的,' U'无法忍受的,'我'可以办理登机手续,' O'可以退房。如何编写regexp过滤器以获取特定日期中可用的所有单元?
我试图找到' A'在DailyAvailability字段中具有特定长度和偏移量的子字符串。例如,要查找从今天起7天内可以使用7天的单位:
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailability": {"value": ".{7}a{7}.*" } }
}
]
}
}
}
此查询返回具有DateAvailability的实例单元,该单元从" UUUUUUUUUUUUUUUUUUUUIAA"开始,但在字段内部包含合适的序列。如何为整个源字符串锚定regexp? ES文档说lucene正则表达式应该默认锚定。
P.S。我试过了'^.{7}a{7}.*$'
。返回空集。
答案 0 :(得分:2)
看起来您正在使用text
数据类型来存储Unit.DailyAvailability
(如果您使用dynamic mapping,这也是字符串的默认值)。您应该考虑使用keyword
数据类型。
让我更详细地解释一下。
text
字段中间的某些内容? text
数据类型的作用是分析数据以进行全文搜索。它做了一些转换,如小写和拆分为令牌。
让我们尝试对您的输入使用Analyze API:
POST _analyze
{
"text": "UIAOUUUUUUUIAAAAAAAAAAAAAAAAAOUUUUIAAAAOUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUIAAAAAOUUUUUUUUUUUUUIAAAAOUUUUUUUUUUUUUIAAAAAAAAOUUUUUUIAAAAAAAAAOUUUUUUUUUUUUUUUUUUIUUUUUUUUIUUUUUUUUUUUUUUIAAAOUUUUUUUUUUUUUIUUUUIAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
回复是:
{
"tokens": [
{
"token": "uiaouuuuuuuiaaaaaaaaaaaaaaaaaouuuuiaaaaouuuiaouuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuiaaaaaouuuuuuuuuuuuuiaaaaouuuuuuuuuuuuuiaaaaaaaaouuuuuuiaaaaaaaaaouuuuuuuuuuuuuuuuuuiuuuuuuuuiuuuuuuuuuuuuuuiaaaouuuuuuuuuuuuuiuuuuiaouuuuuuuuuuuuuuu",
"start_offset": 0,
"end_offset": 255,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "uuuuuuuuuuuuuuiaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 255,
"end_offset": 510,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaouuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"start_offset": 510,
"end_offset": 732,
"type": "<ALPHANUM>",
"position": 2
}
]
}
正如您所看到的,Elasticsearch已将您的输入拆分为三个令牌并将其缩小。这看起来很意外,但是如果你认为它实际上试图在人类语言中搜索单词,那就没有意义了 - 没有这么长的单词。
这就是为什么现在regexp
查询".{7}a{7}.*"
会匹配:有一个令牌实际上以很多a
开头,这是一个{ {3}} regexp
查询。
... Elasticsearch会将正则表达式应用于生成的术语 该字段的标记化器,而不是该字段的原始文本。
regexp
查询考虑整个字符串?非常简单:不要使用分析仪。类型expected behavior按原样存储您提供的字符串。
使用这样的映射:
PUT my_regexes
{
"mappings": {
"doc": {
"properties": {
"Unit": {
"properties": {
"DailyAvailablity": {
"type": "keyword"
}
}
}
}
}
}
}
您将可以执行此类查询,以匹配帖子中的文档:
POST my_regexes/doc/_search
{
"query": {
"bool": {
"filter": [
{
"regexp": { "Unit.DailyAvailablity": "UIAOUUUUUUUIA.*" }
}
]
}
}
}
请注意,查询区分大小写,因为未分析该字段。
此regexp
不再返回任何结果:".{12}a{7}.*"
这将:".{12}A{7}.*"
正则表达式为keyword
:
Lucene的模式总是固定不变的。提供的模式必须与整个字符串匹配。
看起来锚定错误的原因很可能是因为在分析的text
字段中,令牌被分开了。
希望有所帮助!
答案 1 :(得分:1)
除了尼古拉瓦西里耶夫的精彩和乐于助人的答案之外。在我的情况下,我被迫进一步使其在NEST .net上运行。我将属性映射添加到DailyAvailability
:
[Keyword(Name = "DailyAvailability")]
public string DailyAvailability { get; set; }
过滤器仍无效,我得到了映射:
"DailyAvailability":"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
我的字段包含大约732个符号,因此被索引忽略。我试过了:
[Keyword(Name = "DailyAvailability", IgnoreAbove = 1024)]
public string DailyAvailability { get; set; }
它对地图没有任何影响。只有在添加手动映射后,它才开始正常工作:
var client = new ElasticClient(settings);
client.CreateIndex("vrp", c => c
.Mappings(ms => ms.Map<Unit>(m => m
.Properties(ps => ps
.Keyword(k => k.Name(u => u.DailyAvailability).IgnoreAbove(1024))
)
)
));
重点是that:
ignore_above - 不要索引长于此值的任何字符串。默认为2147483647,以便接受所有值。但请注意,默认动态映射规则创建一个子关键字字段,通过设置ignore_above:256来覆盖此默认值。
因此,如果需要使用regexp过滤它们,请使用长关键字字段的显式映射来设置ignore_above
。
答案 2 :(得分:0)
对于任何人都可能有用,ES 工具不支持 \d \w 模式,您应该将它们写为 [0-9] 和 [a-z]