term frequency of documents with Nest Elasticsearch

时间:2015-06-26 10:09:42

标签: c# elasticsearch nest word-frequency

I am new in elasticsearch and want to get the top N term frequency of the "content" field of a specific document using Nest elasticsearch. I've searched a lot to find a proper answer that works for me, but I just got that I should use Terms vector and not Term Facet since it counts the terms in the whole set of documents. I know that I should do some settings for Term Vector like below;

[ElasticProperty(Type = Nest.FieldType.attachment, TermVector =Nest.TermVectorOption.with_positions_offsets, Store = true)]
    public Attachment File { get; set; }

I searched for getting the term frequency of a specific document using Nest Elasticsearch a lot but all I found were about Lucene and Solr. I need an example in Nest elasticsearch. I appreciate your help.


One more question; Actually the solution(suggested by Rob) works well when I want to get the Term frequency of a string like the title of my documents. But when I change the target Field to the Content of the documents, I gain no results back! in order to be able to search the content of documents, I followed the answer in this link: ElasticSearch & attachment type (NEST C#) and it works fine and I can search a term through the Content of the document but for getting the TF it does not work; Below is the code for it;

var searchResults = client.TermVector<Document>(t =>t.Id(ID).TermStatistics().Fields(f => f.File));    

Does anyone have a solution for it?

1 个答案:

答案 0 :(得分:2)

您可以client.TermVector(..)执行此操作。这是一个简单的例子:

文档类:

public class MyDocument
{
    public int Id { get; set; } 
    [ElasticProperty(TermVector = TermVectorOption.WithPositionsOffsets)]
    public string Description { get; set; }
    [ElasticProperty(Type = FieldType.Attachment, TermVector =TermVectorOption.WithPositionsOffsetsPayloads, Store = true, Index = FieldIndexOption.Analyzed)]
    public Attachment File { get; set; }
}

索引一些测试数据:

var indicesOperationResponse = client.CreateIndex(indexName, c => c
    .AddMapping<MyDocument>(m => m.MapFromAttributes()));

var myDocument = new MyDocument {Id = 1, Description = "test cat test"};
client.Index(myDocument);
client.Index(new MyDocument {Id = 2, Description = "river"});
client.Index(new MyDocument {Id = 3, Description = "test"});
client.Index(new MyDocument {Id = 4, Description = "river"});

client.Refresh();

通过NEST检索术语统计信息:

var termVectorResponse = client.TermVector<MyDocument>(t => t
    .Document(myDocument)
    //.Id(1) //you can specify document by id as well
    .TermStatistics()
    .Fields(f => f.Description));

foreach (var item in termVectorResponse.TermVectors)
{
    Console.WriteLine("Field: {0}", item.Key);

    var topTerms = item.Value.Terms.OrderByDescending(x => x.Value.TotalTermFrequency).Take(10);
    foreach (var term in topTerms)
    {
        Console.WriteLine("{0}: {1}", term.Key, term.Value.TermFrequency);
    }
}

输出:

Field: description
cat: 1
test: 2

希望它有所帮助。

<强> 更新

当我检查索引的映射时,有一件事很有趣:

{
    "my_index" : {
        "mappings" : {
            "mydocument" : {
                "properties" : {
                    "file" : {
                        "type" : "attachment",
                        "path" : "full",
                        "fields" : {
                            "file" : {
                                "type" : "string"
                            },
                            "author" : {
                                "type" : "string"
                            },
                            "title" : {
                                "type" : "string"
                            },
                            "name" : {
                                "type" : "string"
                            },
                            "date" : {
                                "type" : "date",
                                "format" : "dateOptionalTime"
                            },
                            "keywords" : {
                                "type" : "string"
                            },
                            "content_type" : {
                                "type" : "string"
                            },
                            "content_length" : {
                                "type" : "integer"
                            },
                            "language" : {
                                "type" : "string"
                            }
                        }
                    },
                    "id" : {
                        "type" : "integer"
                    }
                }
            }
        }
    }
}

没有关于术语向量的信息。

当我通过感觉创建索引时:

PUT http://localhost:9200/my_index/mydocument/_mapping
{
  "mydocument": {
    "properties": {
      "file": {
        "type": "attachment",
        "path": "full",
        "fields": {
          "file": {
            "type": "string",
            "term_vector":"with_positions_offsets",
            "store": true
          }
        }
      }
    }
  }
}

我能够检索术语统计信息。

希望我稍后会通过NEST创建工作映射。

<强> UPDATE2

基于Greg's answer尝试这种流畅的映射:

var indicesOperationResponse = client.CreateIndex(indexName, c => c
        .AddMapping<MyDocument>(m => m
            .MapFromAttributes()
            .Properties(ps => ps
                .Attachment(s => s.Name(p => p.File)
                    .FileField(ff => ff.Name(f => f.File).TermVector(TermVectorOption.WithPositionsOffsets)))))
    );