elasticsearch和postfix日志分析

时间:2016-01-19 14:07:13

标签: php elasticsearch

我们有一个elasticsearch监控postfix邮件日志。每个月我们都会根据这个后缀日志生成统计数据,我们通过PHP在过去30天生成的logstash索引上进行远程查询。

PHP脚本搜索,对于给定的sasl_username或给定的IP(基于身份验证类型),所有queue_id并将其存储在一个数组中,该数组分为1024个值的块(其中1024是最大子句数)在elasticsearch中配置。)

之后,脚本循环遍历queue_id数组,并在OR中为queue_ids的每个块构建一个字符串,然后对此queue_ids字符串发送查询过滤。

以下代码应澄清:

$url_count = "http://xxxxxxxxx:9200/logstash-*/_search?search_type=count";
$url_search = "http://xxxxxxxxx:9200/logstash-*/_search?pretty";

$query_count_sasl_username = '{"query": {
                                "filtered": {
                                    "query": {
                                        "range" : { 
                                            "@timestamp" : { 
                                                "gte" : "now-30d/d", 
                                                "lt" : "now/d" 
                                            }
                                        }
                                    },
                                    "filter":{
                                        "bool":{
                                            "must":[{
                                                "query":{
                                                    "match":{
                                                        "sasl_username":{
                                                            "query": "' . $username . '","type":"phrase"
                                                        }
                                                    }
                                                }
                                            }]
                                        }
                                    }
                                }
                            }}';

    $res = exec_curl($url_count, $query_count_sasl_username);

    $n_totale_records = json_decode($res)->hits->total;

    $query_get_all_sasl_username = ' {"fields" : ["queue_id"],"query": {"filtered": {"query": {"range" : { "@timestamp" : { "gte" : "now-30d/d", "lt" : "now/d" }}},
    "filter":{"bool":{"must":[{"query":{"match":{"sasl_username":{"query":"' . $username . '","type":"phrase"}}}}]}}}}, 
    "size": ' .  $n_totale_records . '}';

    $all_sasl_username = exec_curl($url_search, $query_get_all_sasl_username);
    //returns all with sasl_username searched
    $array_result = json_decode($all_sasl_username);
    $array_domains = array();
    $array_from = array();

    $ids = array();
    $ids_str_tmp = "";
    $tmp_index_count = 0;

    for($i = 0; $i < count($array_result->hits->hits); $i++){       
        $ids_str_tmp .=  "\\\"" . $array_result->hits->hits[$i]->fields->queue_id[0] ."\\\"";

        if($tmp_index_count != $MAX_CLAUSE_COUNT - 1 && $i != count($array_result->hits->hits) - 1){
            $ids_str_tmp .= " OR ";
        }

        if($tmp_index_count == $MAX_CLAUSE_COUNT - 1 || $i == count($array_result->hits->hits) - 1){
            array_push($ids, $ids_str_tmp);
            $ids_str_tmp = "";
            $tmp_index_count = 0;
        }
        else{
            $tmp_index_count++;
        }
    }

    //now I have all json node with queue_id searched divided into blocks of 1024
    for($i = 0; $i < count($ids); $i++){
        $query_get_by_ids_count = '{"query":{
                                        "filtered": {
                                            "query": {
                                              "range" : { 
                                                "@timestamp" : { 
                                                  "gte" : "now-30d/d", "lt" : "now/d"
                                                }
                                              }
                                            },"filter":{
                                                "query_string":{
                                                    "query":"queue_id: ' . $ids[$i] . '"
                                                }
                                            }
                                        }
                                    }}';

        $res = exec_curl($url_count, $query_get_by_ids_count);

        $size_elements = json_decode($res)->hits->total;

        $query_get_by_ids = '{"query":{
                                "filtered": {
                                    "query": {
                                        "range" : {
                                            "@timestamp" : { 
                                                "gte" : "now-30d/d", "lt" : "now/d"
                                            }
                                        }
                                    },
                                    "filter":{
                                        "query_string":{
                                            "query":"queue_id: ' . $ids[$i] . '"}
                                        }
                                    }
                                }, 
                                "size": ' .  $size_elements . 
                            '}';

        $all_by_queue_id = exec_curl($url_search, $query_get_by_ids);

        $array_all_by_queue_id = json_decode($all_by_queue_id);
        //check in every node for field from and save it in an array
        for($j = 0; $j < count($array_all_by_queue_id->hits->hits); $j++){
            if(property_exists($array_all_by_queue_id->hits->hits[$j]->_source, "from")){
                array_push($array_from,$array_all_by_queue_id->hits->hits[$j]->_source->from);
                array_push($array_domains,  substr($array_all_by_queue_id->hits->hits[$j]->_source->from, strrpos($array_all_by_queue_id->hits->hits[$j]->_source->from, "@") + 1));
            }
        }
    }

问题在于此脚本的性能,对于在一天内queue_id上有1670个不同postfix_log的用户,大约需要5秒钟。 我认为问题在于OR块的创建会降低性能,因为在启动查询时,响应很快。

有没有办法提高性能或使用elasticsearch对postfix日志进行统计的其他方法?

0 个答案:

没有答案