在我的Elasticsearch中的一个字段中,我存储了我的文档的网址(例如http://techcrunch.com/something-great
)
当我没有转义URL时,文档被正确找到 - 但我在某些网址上收到了EOF错误。
当我使用以下内容转义网址时
String escapedString = QueryParser.escape(e.getKey().getUrl());
找不到文件 - 我点击率为零。
那怎么办呢?
{
_index: "crawlbot",
_type: "article",
_id: "AVFaaFu4w49jUzVInKS5",
_score: 1,
_source: {
job: {
id: 65,
name: "wikipedia_en",
max_pages: 300000,
crawl_depth: 0,
processing_patterns: "-Category,-User,-Wikipedia:,-Topic,-Special:,-Talk:,-Portal:,-MOS",
status: 0,
days: 0,
url: [
"https://en.wikipedia.org"
],
ajax: false,
min_description: 0
},
article: {
url: "https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania",
provider_url: "https://en.wikipedia.org",
provider_name: "",
provider_display: "en.wikipedia.org",
favicon_url: "http://www.google.com/s2/u/0/favicons?domain=https://en.wikipedia.org",
language: "en",
metadata: {
authors: []
},
entities: [],
keywords: [],
videos: [],
unfilteredKeywords: [],
published: "",
published_long: 0
}
}
}
我希望每篇文章检索文档.url
这是查询:
SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
String escapedString = QueryParser.escape(e.getKey().getUrl());
queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField("article.url"));
queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey().getJob().getId() + "").defaultField("job.id"));
如果我没有逃脱错误:
Exception in thread "main" org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], all shards failed; shardFailures {[9_T8APppReyWKppSNZWmXw][crawlbot][0]: SearchParseException[[crawlbot][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][1]: SearchParseException[[crawlbot][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][2]: SearchParseException[[crawlbot][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][3]: SearchParseException[[crawlbot][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][4]: SearchParseException[[crawlbot][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
答案 0 :(得分:2)
我建议您将article.url
字段的映射更改为:
url: {
"type": "string",
"index": "not_analyzed"
}
如果标准分析器将URL分解为多个令牌,那么如果不这样做将会对您的字段进行分析并且很难查询。
然后,您可以使用query_string
查询来查询文档,而不是使用term
查询。
SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery("article.url", e.getKey().getUrl()));
... ^
|
use a term query instead
<强>更新强>
跟进Evaldas的评论(kudos Evaldas!),最后的想法是创建一个自定义分析器,以确保URL也是小写的。
创建索引时,您可以在settings
中添加新分析器,然后将其用作article.url
字段的分析器:
PUT /crawlbot
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"article": {
"properties": {
"article": {
"url": {
"type": "string",
"analyzer": "url_analyzer"
}
}
}
}
}
}