ElasticSearch文档结构不单独索引https / http

时间:2015-09-02 20:14:28

标签: c# elasticsearch nest

我一直在构建一个ElasticSearch“网页”索引,该索引将用于支持在线网站搜索。我有一个C#类,我已经构建并装饰了一些Nest属性,但我仍然有点不确定我已经覆盖了我可能需要的所有内容。

这是我的班级:

[ElasticType(IdProperty = "url_id")]
public class WebPage
{
    /// <summary>
    /// Thee last time this document was indexed
    /// </summary>
    public string dateScanned { get; set; }

    /// <summary>
    /// The ACTUAL mime type returned.  Can be something like application/vnd.openxmlformats-officedocument.presentationml.presentation
    /// </summary>
    [ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
    public string mimeType { get; set; }

    /// <summary>
    /// Human-friendly type.  Like: HTML, DOC, PPT
    /// </summary>
    [ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
    public string shortMimeType { get; set; }

    /// <summary>
    /// The URL without protocol.  Prevents indexing http:// and https:// as two separate index pages
    /// This is used as the ID field in ES.
    /// </summary>
    public string url_id { get; set; }

    /// <summary>
    /// The url we use when building a link.  DOES include protocol
    /// </summary>
    public string url { get; set; }

    //the rest are your standard fields for a simple "document"
    public string body { get; set; }
    public string keywords { get; set; }
    public string description { get; set; }
    public string title { get; set; }
}

我遇到的一个问题是,如果我使用ElasticSearch ID的完整URL,我最终可能会为同一页面提供两个条目。即。

为了防止这种情况,我决定将url存储在没有协议的“url_id”字段中(上面的示例中为www.example.com/),并将其用作ES标识符。然后我还将完整的URL存储在“url”字段中,这将是在查询期间打印到页面的URL。

我看到的问题是,有时url字段会指向http,其他时间指向https - 它将是“last”索引的任何一个。

有没有更好的方法来处理协议存储问题?

0 个答案:

没有答案