Question

我一直在构建一个ElasticSearch“网页”索引，该索引将用于支持在线网站搜索。我有一个C＃类，我已经构建并装饰了一些Nest属性，但我仍然有点不确定我已经覆盖了我可能需要的所有内容。

这是我的班级：

[ElasticType(IdProperty = "url_id")]
public class WebPage
{
    /// <summary>
    /// Thee last time this document was indexed
    /// </summary>
    public string dateScanned { get; set; }

    /// <summary>
    /// The ACTUAL mime type returned.  Can be something like application/vnd.openxmlformats-officedocument.presentationml.presentation
    /// </summary>
    [ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
    public string mimeType { get; set; }

    /// <summary>
    /// Human-friendly type.  Like: HTML, DOC, PPT
    /// </summary>
    [ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
    public string shortMimeType { get; set; }

    /// <summary>
    /// The URL without protocol.  Prevents indexing http:// and https:// as two separate index pages
    /// This is used as the ID field in ES.
    /// </summary>
    public string url_id { get; set; }

    /// <summary>
    /// The url we use when building a link.  DOES include protocol
    /// </summary>
    public string url { get; set; }

    //the rest are your standard fields for a simple "document"
    public string body { get; set; }
    public string keywords { get; set; }
    public string description { get; set; }
    public string title { get; set; }
}

我遇到的一个问题是，如果我使用ElasticSearch ID的完整URL，我最终可能会为同一页面提供两个条目。即。

为了防止这种情况，我决定将url存储在没有协议的“url_id”字段中（上面的示例中为www.example.com/），并将其用作ES标识符。然后我还将完整的URL存储在“url”字段中，这将是在查询期间打印到页面的URL。

我看到的问题是，有时url字段会指向http，其他时间指向https - 它将是“last”索引的任何一个。

有没有更好的方法来处理协议存储问题？

ElasticSearch文档结构不单独索引https / http

0 个答案: