我一直在构建一个ElasticSearch“网页”索引,该索引将用于支持在线网站搜索。我有一个C#类,我已经构建并装饰了一些Nest属性,但我仍然有点不确定我已经覆盖了我可能需要的所有内容。
这是我的班级:
[ElasticType(IdProperty = "url_id")]
public class WebPage
{
/// <summary>
/// Thee last time this document was indexed
/// </summary>
public string dateScanned { get; set; }
/// <summary>
/// The ACTUAL mime type returned. Can be something like application/vnd.openxmlformats-officedocument.presentationml.presentation
/// </summary>
[ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
public string mimeType { get; set; }
/// <summary>
/// Human-friendly type. Like: HTML, DOC, PPT
/// </summary>
[ElasticProperty(Index = FieldIndexOption.NotAnalyzed)] //so we can do aggregates / faceted search on it
public string shortMimeType { get; set; }
/// <summary>
/// The URL without protocol. Prevents indexing http:// and https:// as two separate index pages
/// This is used as the ID field in ES.
/// </summary>
public string url_id { get; set; }
/// <summary>
/// The url we use when building a link. DOES include protocol
/// </summary>
public string url { get; set; }
//the rest are your standard fields for a simple "document"
public string body { get; set; }
public string keywords { get; set; }
public string description { get; set; }
public string title { get; set; }
}
我遇到的一个问题是,如果我使用ElasticSearch ID的完整URL,我最终可能会为同一页面提供两个条目。即。
为了防止这种情况,我决定将url存储在没有协议的“url_id”字段中(上面的示例中为www.example.com/),并将其用作ES标识符。然后我还将完整的URL存储在“url”字段中,这将是在查询期间打印到页面的URL。
我看到的问题是,有时url字段会指向http,其他时间指向https - 它将是“last”索引的任何一个。
有没有更好的方法来处理协议存储问题?