如何在elastic4s和elasticsearch中实现PatternAnalyzer以排除某个字段的结果

时间:2015-06-15 12:10:43

标签: elasticsearch playframework elastic4s

我正在尝试对我的索引执行查询,并获得所有没有带有重力图像的审阅者的评论。为此,我实现了一个带有主机模式的PatternAnalyzerDefinition:

"^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)"

应匹配并提取网址的主机,如:

https://www.gravatar.com/avatar/blablalbla?s=200&r=pg&d=mm

变为:

www.gravatar.com

映射:

clientProvider.getClient.execute {
          create.index(_index).analysis(
            phraseAnalyzer,
            PatternAnalyzerDefinition("host_pattern", regex = "^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)")
          ).mappings(
"reviews" as (
             .... Cool mmappings
              "review" inner (
                "grade" typed LongType,
                "text" typed StringType index "not_analyzed",
                "reviewer" inner (
                  "screenName" typed StringType index "not_analyzed",
                  "profilePicture" typed StringType analyzer "host_pattern",
                  "thumbPicture" typed StringType index "not_analyzed",
                  "points" typed LongType index "not_analyzed"
                ),                    
               .... Other cool mmappings                    
              )
            ) all(false)
} map { response =>
      Logger.info("Create index response: {}", response)
    } recover {
      case t: Throwable => play.Logger.error("Error creating index: ", t)
    }

查询:

val reviewQuery = (search in path)
      .query(
        bool(
          must(
            not(
              termQuery("review.reviewer.profilePicture", "www.gravatar.com")
            )
          )
        )
      )
      .postFilter(
        bool(
          must(
            rangeFilter("review.grade") from 3
          )
        )
      )
      .size(size)
      .sort(by field "review.created" order SortOrder.DESC)

    clientProvider.getClient.execute {      
      reviewQuery
    }.map(_.getHits.jsonToList[ReviewData])

检查映射的索引:

reviewer: {
    properties: {
        id: {
            type: "long"
        },
        points: {
            type: "long"
        },
        profilePicture: {
            type: "string",
            analyzer: "host_pattern"
        },
        screenName: {
            type: "string",
            index: "not_analyzed"
        },
        state: {
            type: "string"
        },
        thumbPicture: {
            type: "string",
            index: "not_analyzed"
        }
    }
}

当我执行查询时,模式匹配似乎不起作用。我仍然会与拥有重力图像的评论者进行评论。 我究竟做错了什么?也许我误解了PatternAnalyzer?

我正在使用 “com.sksamuel.elastic4s”%%“elastic4s”%“1.5.9”,

1 个答案:

答案 0 :(得分:0)

我想再一次RTFM就在这里:

docs州:

重要提示:正则表达式应与令牌分隔符匹配,而不是与令牌本身匹配。

意味着在我的情况下匹配的令牌www.gravatar.com将不会 分析该领域后的一部分代币。

而是使用Pattern Capture Token Filter

首先声明一个新的CustomAnalyzerDefinition:

val hostAnalyzer = CustomAnalyzerDefinition(
    "host_analyzer",
    StandardTokenizer,
    PatternCaptureTokenFilter(
      name = "hostFilter",
      patterns = List[String]("^https?\\:\\/\\/([^\\/?#]+)(?:[\\/?#]|$)"),
      preserveOriginal = false
    )
  )

然后将分析仪添加到字段中:

"review" inner (              
                "reviewer" inner (
                  "screenName" typed StringType index "not_analyzed",
                  "profilePicture" typed StringType analyzer "hostAnalyzer",
                  "thumbPicture" typed StringType index "not_analyzed",
                  "points" typed LongType index "not_analyzed"
                )
)

create.index(_index).analysis(
            someAnalyzer,
            phraseAnalyzer,
            hostAnalyzer
          ).mappings(

瞧。有用。检查令牌和索引的一个非常好的工具是调用:

/[index]/[collection]/[id]/_termvector?fields=review.reviewer.profilePicture&pretty=true