我需要一些帮助来解释我的ES设置中发生的事情。基本上,我已经使用自定义分析器(我们支持的每种语言一个)创建了多个索引,并且在索引时为我们拥有的每个客户端创建了映射。问题出现在搜索时,当我在所有客户索引中进行搜索时,一个特定索引(英语)总是排名高于其他语言,即使搜索的术语在英文索引文档中出现次数较少。
所以这就是我在ES设置中的内容: 我们有多个客户端,每个客户端都可以上传多种语言的文档。因此,为了满足这一要求,我设置了根据clientId和语言命名的索引,即A-en,A-de,A-fr,B-en,B-it等(其中A和B是客户端ID,并且-xx是ISO语言代码)。每个索引都是使用自定义分析器为该客户端所需的语言创建的,并且每个字段都映射为在设置部分使用这些自定义分析器,如下所示: 这是一个英文索引设置,所有带有“英文”文档的客户都将被编入索引:
{
"settings" : {
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"english_keywords" : {
"type" : "keyword_marker",
"keywords" : ["_none_"]
},
"english_stop" : {
"type" : "stop",
"stopwords" : ["_none_"]
},
"synonym_filter" : {
"type" : "synonym",
"expand" : 1,
"synonyms" : ["_none_"]
},
"english_stemmer" : {
"type" : "stemmer",
"language" : "english"
}
},
"analyzer" : {
"lens-english" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["english_keywords", "lowercase", "english_stop", "english_stemmer", "synonym_filter"]
}
}
}
},
"mappings" : {
"video" : {
"properties" : {
"Attributes" : {
"type" : "string",
"index" : "not_analyzed"
},
"ClientId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Comments" : {
"type" : "string",
"analyzer" : "lens-english"
},
"Continent" : {
"type" : "string",
"index" : "not_analyzed"
},
"CountryOfOrigin" : {
"type" : "string",
"index" : "not_analyzed"
},
"CreatedDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Description" : {
"type" : "string",
"analyzer" : "lens-english"
},
"DescriptionEnglish" : {
"type" : "string",
"analyzer" : "english"
},
"DislikesCount" : {
"type" : "double"
},
"EnglishTranscription" : {
"type" : "string",
"analyzer" : "english"
},
"Favourite" : {
"type" : "string",
"index" : "not_analyzed"
},
"FromProject" : {
"type" : "boolean"
},
"IsSearchable" : {
"type" : "boolean"
},
"LanguageISOCode" : {
"type" : "string",
"index" : "not_analyzed"
},
"LanguageOfOrigin" : {
"type" : "string",
"index" : "not_analyzed"
},
"LikesCount" : {
"type" : "double"
},
"NativeTranscription" : {
"type" : "string",
"analyzer" : "lens-english"
},
"ObjectId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Published" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Recommendations" : {
"type" : "string",
"index" : "not_analyzed"
},
"Status" : {
"type" : "long"
},
"Tags" : {
"type" : "string",
"analyzer" : "lens-english"
},
"Title" : {
"type" : "string",
"analyzer" : "lens-english"
},
"TitleEnglish" : {
"type" : "string",
"analyzer" : "english"
},
"TranscriptionStatus" : {
"type" : "double"
},
"UploadSource" : {
"type" : "double"
},
"VideoImage" : {
"type" : "string",
"index" : "no"
},
"ViewCount" : {
"type" : "double"
},
"WatchLater" : {
"type" : "string",
"index" : "not_analyzed"
},
"ExternalMetadata" : {
"type" : "nested",
"properties" : {
"Filters" : {
"type" : "string",
"index" : "not_analyzed"
},
"ProjectId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Roles" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}
}
}
以下是土耳其语索引,其中包含需要索引的土耳其语文档...
{
"settings" : {
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"turkish_stop" : {
"type" : "stop",
"stopwords" : "_turkish_"
},
"synonym_filter" : {
"type" : "synonym",
"synonyms" : ["_none_"]
},
"turkish_lowercase" : {
"type" : "lowercase",
"language" : "turkish"
},
"turkish_keywords" : {
"type" : "keyword_marker",
"keywords" : ["_none_"]
},
"turkish_stemmer" : {
"type" : "stemmer",
"language" : "turkish"
}
},
"analyzer" : {
"lens-turkish" : {
"tokenizer" : "standard",
"filter" : ["apostrophe", "turkish_lowercase", "turkish_stop", "turkish_keywords", "turkish_stemmer", "synonym_filter"]
},
"folding" : {
"filter" : ["lowercase", "asciifolding"],
"tokenizer" : "standard"
}
}
}
},
"mappings" : {
"video" : {
"properties" : {
"Attributes" : {
"type" : "string",
"index" : "not_analyzed"
},
"ClientId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Comments" : {
"type" : "string",
"analyzer" : "lens-turkish"
},
"Continent" : {
"type" : "string",
"index" : "not_analyzed"
},
"CountryOfOrigin" : {
"type" : "string",
"index" : "not_analyzed"
},
"CreatedDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Description" : {
"type" : "string",
"analyzer" : "lens-turkish"
},
"DescriptionEnglish" : {
"type" : "string",
"analyzer" : "english"
},
"DislikesCount" : {
"type" : "double"
},
"EnglishTranscription" : {
"type" : "string",
"analyzer" : "english"
},
"Favourite" : {
"type" : "string",
"index" : "not_analyzed"
},
"FromProject" : {
"type" : "boolean"
},
"IsSearchable" : {
"type" : "boolean"
},
"LanguageISOCode" : {
"type" : "string",
"index" : "not_analyzed"
},
"LanguageOfOrigin" : {
"type" : "string",
"index" : "not_analyzed"
},
"LikesCount" : {
"type" : "double"
},
"NativeTranscription" : {
"type" : "string",
"analyzer" : "lens-turkish"
},
"ObjectId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Published" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"Recommendations" : {
"type" : "string",
"index" : "not_analyzed"
},
"Status" : {
"type" : "long"
},
"Tags" : {
"type" : "string",
"analyzer" : "lens-turkish"
},
"Title" : {
"type" : "string",
"analyzer" : "lens-turkish"
},
"TitleEnglish" : {
"type" : "string",
"analyzer" : "english"
},
"TranscriptionStatus" : {
"type" : "double"
},
"UploadSource" : {
"type" : "double"
},
"VideoImage" : {
"type" : "string",
"index" : "no"
},
"ViewCount" : {
"type" : "double"
},
"WatchLater" : {
"type" : "string",
"index" : "not_analyzed"
},
"ExternalMetadata" : {
"type" : "nested",
"properties" : {
"Filters" : {
"type" : "string",
"index" : "not_analyzed"
},
"ProjectId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Roles" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
}
}
}
}
所有语言索引都遵循这种模式(我们有24种不同的受支持语言),每个客户端在创建索引时以及将文档索引到这些索引时都会使用其中一种设置。
所以,这一切似乎都很好,ES很满意。现在来到搜索查询,这就是让事情变得混乱的地方。
我的搜索查询基于一个要求,即“短语必须优先于个别条款”。此外,当客户端执行搜索时,必须跨所有文档和语言执行该搜索(因此,为什么使用名称中的客户端ID创建索引)。这是通过在URL的URL中使用通配符来实现的,即/ A - * / video / _search将搜索所有客户端A文档而不管语言。
以下是我向服务器发送的搜索查询...
POST /5617c3c867567a0b0c570a95-*/video/_search
{
"from": "0",
"size": "1000",
"query": {
"template": {
"query": {
"filtered": {
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "{{query_string}}",
"type": "most_fields",
"fields": [
"Title^3",
"Description^2",
"TitleEnglish",
"DescriptionEnglish",
"EnglishTranscription",
"NativeTranscription",
"Tags",
"Comments"
],
"tie_breaker": 0.1,
"minimum_should_match": "70%"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"IsSearchable": true
}
},
{
"term": {
"Private": false
}
}
]
}
}
}
},
"params": {
"query_string": "Turkish"
}
}
}
}
请注意,我正在搜索“土耳其语”这个词,并搜索所有语言。现在看一下结果,注意* -en索引的排名高于* -tr(土耳其语)索引,这些索引在整个文档字段中包含“土耳其语”一词。
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 15,
"successful": 15,
"failed": 0
},
"hits": {
"total": 7,
"max_score": 0.21282451,
"hits": [
{
"_index": "5617c3c867567a0b0c570a95-en",
"_type": "video",
"_id": "561bd2b274cbe0123c099ace",
"_score": 0.21282451,
"_source": {
"CountryOfOrigin": "United Kingdom",
"Continent": "Europe",
"LanguageOfOrigin": "English",
"LanguageIsoCode": "en",
"Title": "Nikes",
"TitleEnglish": "Eng video Eng lang",
"Description": "izlemek",
"DescriptionEnglish": "",
"VideoImage": "ff3a093a-700e-4c53-94df-cc5eb425c043_Image.jpg",
"ViewCount": 9,
"LikesCount": 0,
"DislikesCount": 0,
"CreatedDate": "2015-10-12T15:33:05.634Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 6,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "",
"Tags": [
"Turkish",
"Nike"
],
"Comments": [],
"Attributes": [],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "561bd2b274cbe0123c099ace"
}
},
{
"_index": "5617c3c867567a0b0c570a95-en",
"_type": "video",
"_id": "5617cb8b74cbe2110890820b",
"_score": 0.19917427,
"_source": {
"CountryOfOrigin": "Armenia",
"Continent": "Europe",
"LanguageOfOrigin": "English",
"LanguageIsoCode": "en",
"Title": "English Video",
"TitleEnglish": "English Video",
"DescriptionEnglish": "",
"VideoImage": "df80412b-d6b9-4104-932b-c8e44b005fb2_Image.jpg",
"ViewCount": 16,
"LikesCount": 1,
"DislikesCount": 0,
"CreatedDate": "2015-10-09T14:13:30.893Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 5,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "",
"Tags": [
"Turkish",
"Purple Aki"
],
"Comments": [],
"Attributes": [],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "5617cb8b74cbe2110890820b"
}
},
{
"_index": "5617c3c867567a0b0c570a95-en",
"_type": "video",
"_id": "561bb49e74cbe002f09301fa",
"_score": 0.17025961,
"_source": {
"CountryOfOrigin": "United Kingdom",
"Continent": "Europe",
"LanguageOfOrigin": "English",
"LanguageIsoCode": "en",
"Title": "Mark's Transcription Test",
"TitleEnglish": "Mark's Transcription Test",
"DescriptionEnglish": "",
"VideoImage": "09c6d366-6807-4d9d-9588-fd4730907b9b_Image.jpg",
"ViewCount": 6,
"LikesCount": 0,
"DislikesCount": 0,
"CreatedDate": "2015-10-12T13:24:45.833Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 6,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "",
"Tags": [
"turkish",
"mark",
"Watch"
],
"Comments": [],
"Attributes": [],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "561bb49e74cbe002f09301fa"
}
},
{
"_index": "5617c3c867567a0b0c570a95-tr",
"_type": "video",
"_id": "5617c97c74cbe21108908205",
"_score": 0.12725623,
"_source": {
"CountryOfOrigin": "Turkey",
"Continent": "Asia",
"LanguageOfOrigin": "Turkish",
"LanguageIsoCode": "tr",
"Title": "Turkish Video - Under 10mins - Request Trans",
"TitleEnglish": "Turkish Video - Under 10mins - Request Trans",
"Description": "Turkish - Request Trans",
"DescriptionEnglish": "Turkish - Request Trans",
"VideoImage": "ba4341e5-7af8-418e-91e3-818e290a0989_Image.jpg",
"ViewCount": 21,
"LikesCount": 0,
"DislikesCount": 0,
"CreatedDate": "2015-10-09T14:04:44.033Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 5,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "",
"Tags": [],
"Comments": [
"Turkish",
"Liverpool"
],
"Attributes": [
"5617c80974cbe211089081fd_3_2",
"5617c80974cbe211089081fe_4_1"
],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "5617c97c74cbe21108908205"
}
},
{
"_index": "5617c3c867567a0b0c570a95-tr",
"_type": "video",
"_id": "5617ca3574cbe21108908208",
"_score": 0.07719648,
"_source": {
"CountryOfOrigin": "Argentina",
"Continent": "South America",
"LanguageOfOrigin": "Turkish",
"LanguageIsoCode": "tr",
"Title": "Turkish Video - No Trans",
"TitleEnglish": "Turkish Video - No Trans",
"DescriptionEnglish": "",
"VideoImage": "735f0c09-3c1c-415e-870f-70f18be632ea_Image.jpg",
"ViewCount": 14,
"LikesCount": 0,
"DislikesCount": 0,
"CreatedDate": "2015-10-09T14:07:49.705Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 0,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "",
"Tags": [
"Turkish"
],
"Comments": [],
"Attributes": [],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "5617ca3574cbe21108908208"
}
},
{
"_index": "5617c3c867567a0b0c570a95-de",
"_type": "video",
"_id": "5617c8ca74cbe211089081ff",
"_score": 0.015614418,
"_source": {
"CountryOfOrigin": "Germany",
"Continent": "Europe",
"LanguageOfOrigin": "German",
"LanguageIsoCode": "de",
"Title": "German Video - Under 10mins - With SRT",
"TitleEnglish": "German Video - Under 10mins - With SRT",
"Description": "German Video\nTag: Oct 9",
"DescriptionEnglish": "German Video\nTag: Oct 9",
"VideoImage": "04bf4827-3459-41f6-9fc0-7003dfe7ea5d_Image.jpg",
"ViewCount": 5,
"LikesCount": 0,
"DislikesCount": 0,
"Published": "2015-10-09T14:03:01.066Z",
"CreatedDate": "2015-10-09T14:01:46.517Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 5,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "Ich denke, dass Nachhaltigkeit sich darum dreht,Verpackungen zu reduzieren oder Energie, die bei der Produktion entsteht,zu verringern oder auch lokal zu produzieren,um die CO2-Bilanz zu reduzieren.Ich glaube, dass sich viele Verbraucherbeim Einkaufen über Nachhaltigkeit Gedanken machen,was letztendlich auch beeinflusst was sie kaufen,vor allem aber würde ich von mir als Verbraucherin behaupten,dass ich mich an die Firmen halte, die die gleichen Wertebezüglich Nachhaltigkeit haben wie ich.Ich gehe gezielt in Geschäfte, die weniger Verpackung benutzenoder solche, die man einfacher recyclen kannund wenn wir können, gehen wir immer zu Fuß zu regionalenoder lokalen Geschäften, wenn sie in der Nähe sind.Und viele Unternehmen versuchen die gleichen Produktefür einen niedrigeren Preis zu verkaufen,aber wenn eine Firma mich überzeugen kann, dass ihre Produkte nachhaltiger sindoder sicherer für mich und meine Umwelt,wäre ich am Ende auch bereit, mehr zu bezahlen.Wenn ein Unternehmen behauptet, nachhaltig zu sein,will ich immer herausfinden auf welche Art und Weisesie sicherer sind.Es gibt so viele Öko-Zertifikateund ich weiß nicht was die bedeutenoder ob sie wirklich für Nachhaltigkeit stehen.Vielleicht könnte es einen Beschluss geben,der es den Verbrauchern einfacher macht,nachhaltige Produkte zu verstehen, das wäre für alle eine große Hilfe.",
"EnglishTranscription": "I think that sustainability turns about, Packaging to reduce or energy generated in the production, to reduce or even locally to produce, to reduce the CO2 footprint. I think that to many consumers worry buy about sustainability, What ultimately affects what you buy but above all, I would argue by me as a consumer, that I the companies consider myself, the same values as I have with regard to sustainability. I'm specifically going to shops that use less packaging or such which is easier to recycle can and if we can, we go to regional always walking or local shops if they are nearby. And many companies are trying the same products for sale, for a lower price But if a company can convince me that their products are more sustainable or safe for me and my environment. would I also be willing to pay more at the end. If a company claims to be sustainable. will I always find out in what way they are safer. There are so many eco-certificates and I don't know what you mean or whether they really are for sustainability. Perhaps there could be a decision, Consumers easier makes it,. understanding sustainable products that would be a great help for everyone.",
"Tags": [
"Oct 9",
"Turkish"
],
"Comments": [],
"Attributes": [
"5617c80974cbe211089081fd_3_2",
"5617c80974cbe211089081fe_4_4"
],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "5617c8ca74cbe211089081ff"
}
},
{
"_index": "5617c3c867567a0b0c570a95-tr",
"_type": "video",
"_id": "561b860d74cbe0103cf23369",
"_score": 0.011710813,
"_source": {
"CountryOfOrigin": "Turkey",
"Continent": "Asia",
"LanguageOfOrigin": "Turkish",
"LanguageIsoCode": "tr",
"Title": "izlemek Nike",
"TitleEnglish": "Demo 4",
"Description": "izlemek Nike",
"DescriptionEnglish": "Demo 4",
"VideoImage": "97e66fe2-6f62-4a43-b234-0abda414dedf_Image.jpg",
"ViewCount": 17,
"LikesCount": 0,
"DislikesCount": 0,
"Published": "2015-10-12T10:07:52.281Z",
"CreatedDate": "2015-10-12T10:06:05.015Z",
"WatchLater": [],
"Favourite": [],
"Status": 2,
"TranscriptionStatus": 5,
"UploadSource": 3,
"IsSearchable": true,
"FromProject": false,
"NativeTranscription": "Şimdi makyaj masamın başına geçtimVe makyajımı yapmaya başlayacağımÖncelikle güzel bir baz süreceğimSmashbox'ın Photo Finish bazını kullanacağımÖnce göz makyajımı yapacağımBugün böyle altın ve siyah tonlarındaya da altın kahve tonlarında bir makyaj yapmayı planlıyorumÇünkü, giyeceğim bir ceket varCeket de altın zincirler ve altın detaylar taşıyorEe tabii, söz konusu altın olduğu zamanAltın ve bronz ve doğal tonlar olduğu zamanNaked paletimden elimi çekemiyorumEe tabii far kullanacaksam, bir far bazı kullanmadan olmazUrban Decay far kullanacağım içintesadüfen Urban Decay'den primer potion göz bazını kullanacağımŞu kadar miktar benim için yeterliBeni biraz böyle nefes nefese vehani koşturur vaziyette görebilirsinizÇünkü birazcık acelem varVe hazır böyle güzel bir saç makyaj gibi bir şey planlıyorlenNeden videosunu çekmeyeyim, diye düşündüm",
"EnglishTranscription": "Now I take over my dressing table And I'm going to start doing my makeup First of all, I'm going to drive a beautiful base Smashbox's Photo Finish base to use First, I'm going to do my eye makeup Today in shades of gold and black or I'm planning to do a makeup in shades of gold and coffee I'm going to wear a coat, because there Jacket in gold chains and gold carries the details So of course, when it comes to gold When gold and bronze and natural hues I can't get my hand off my naked palette So of course I use a headlight headlights not without some Urban Decay eyeshadow I use for Incidentally, I'm going to use from the Urban Decay primer potion eye base This quantity is enough for me That's me a little breathless and you know, the one you can see running condition Because it's a little bit of a hurry And such a beautiful something like hair make-up ready planned yorlen Why is the video I thought, that I may not",
"Tags": [
"test tag",
"turkish",
"mark",
"izlemek",
"Purple Aki"
],
"Comments": [],
"Attributes": [
"5617c80974cbe211089081fd_3_1"
],
"Recommendations": [],
"ClientId": "5617c3c867567a0b0c570a95",
"Private": false,
"ObjectId": "561b860d74cbe0103cf23369"
}
}
]
}
}
任何知道该寻找什么的人都可以对此有所了解,看看我在这里有什么东西吗?
答案 0 :(得分:0)
术语频率只是计算相关性的一部分 - 反向文档频率和文档长度也很重要。在您的示例中,英语文档排名较高,因为1)它们更短,2)英语索引包含较少的术语和#34;土耳其语#34;使每个文档确实具有该术语排名更高。< / p>