ElasticSearch:将旧访问者数据转换为索引

时间:2015-09-04 22:30:53

标签: python-2.7 elasticsearch kibana-4

我正在学习ElasticSearch,希望将我的业务数据转储到ES并使用Kibana进行查看。经过一周的各种问题,我终于让ES和Kibana在2台Ubuntu 14.04台式机(集群)上工作(分别为1.7.0和4)。

我现在遇到的问题是如何最好地将数据导入ES。数据流是我为每次访问具有唯一ID的文本文件捕获PHP全局变量$ _REQUEST和$ _SERVER。从那里,如果他们填写表格,我将该数据捕获在一个文本文件中,该文件也以不同目录中的唯一ID命名。然后我的客户告诉我,这个表格填写是否有任何好处,延迟最多50天。

所以我从访问者数据开始 - $ _REQUEST和$ _SERVER。其中很多都是多余的,所以我真的只是想捕获他们到达的时间戳,他们的IP,他们访问的服务器的IP,他们访问的域,唯一ID以及他们的用户代理。所以我创建了这个映射:

time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # Originally this included 'index': 'not analyzed' as well

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping, # Stored as a Unix timestamp
        'Request Time': time_date_mapping, # Stored as a Unix timestamp
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
    },
}

然后我将其输入ES:

es.index(
            index=Visit_to_ElasticSearch.INDEX,
            doc_type=Visit_to_ElasticSearch.DOC_TYPE,
            id=self.uniqID,
            timestamp=int(math.floor(self._visit['Entrance Time'])),
            body=visit
        )

当我查看ES上的数据时,只有入口时间,_id,_type,domain和uniqID被索引用于搜索(根据Kibana)。所有数据都存在于文档中,但大多数字段显示"无法搜索未编制索引的字段。"

此外,我试图获得代理商的饼图。但我无法想象得到可视化,因为无论在哪个方框,我点击代理字段都不是聚合选项。刚刚提到它,因为似乎索引的字段确实显示出来。

我试图模仿paintsearch.py​​示例中的映射示例,该示例引入了github。有人可以纠正我如何使用该地图吗?

由于

------------ Mapping -------------

{
  "visits": {
    "mappings": {
      "visit": {
        "properties": {
          "Agent": {
            "type": "string"
          },
          "Entrance Time": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "Raw": {
            "properties": {
              "Entrance Time": {
                "type": "double"
              },
              "domain": {
                "type": "string"
              },
              "uniqID": {
                "type": "string"
              }
            }
          },
          "Referrer": {
            "type": "string"
          },
          "Request Time": {
            "type": "string"
          },
          "Srvr IP": {
            "type": "string"
          },
          "Visitor IP": {
            "type": "string"
          },
          "domain": {
            "type": "string"
          },
          "uniqID": {
            "type": "string"
          }
        }
      }
    }
  }
}

-------------更新和新映射-----------

所以我删除了索引并重新创建了它。在我知道将数据映射到特定字段类型之前,原始索引中包含了一些数据。这似乎解决了这个问题,只有少数字段被编入索引。

但是,我的部分映射似乎被忽略了。特别是代理字符串映射:

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string', 'index': 'not_analyzed' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping,
        'Request Time': time_date_mapping,
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
    },
}

这是http://localhost:9200/visits_test2/_mapping

的输出
{
  "visits_test2": {
    "mappings": {
        "visit": {
          "properties":  {
            "Agent":{"type":"string"},
            "Entrance Time": {"type":"date","format":"dateOptionalTime"},
            "Raw": {
              "properties": {
                "Entrance Time":{"type":"double"},
                "domain":{"type":"string"},
                "uniqID":{"type":"string"}
              }
            },
            "Referrer":{"type":"string"},
            "Request Time": {"type":"date","format":"dateOptionalTime"},
            "Srvr IP":{"type":"string"},
            "Visitor IP":{"type":"string"},
            "domain":{"type":"string"},
            "uniqID":{"type":"string"}
          }
        }
      }
    }
  }

请注意,我使用了全新的索引。原因是我想确保没有任何事情从一个人延续到另一个人。

请注意,我使用Python库elasticsearch.py​​并按照他们的示例进行映射语法。

---------根据评论请求将数据输入ES的Python代码-----------

下面是一个文件名mapping.py,我还没有完全评论代码,因为这只是测试这种数据输入ES的方法是否可行的代码。如果不是不言自明,请告诉我,我会添加其他评论。

注意,在使用Python之前,我在PHP中编程了好几年。为了使用Python更快地启动和运行,我创建了几个带有基本字符串和文件操作函数的文件,并将它们组成一个包。它们是用Python编写的,旨在模仿内置PHP函数的行为。所以当你看到对php_basic_ *的调用时,它就是其中一个函数。

# Standard Library Imports
import json, copy, datetime, time, enum, os, sys, numpy, math
from datetime import datetime
from enum import Enum, unique
from elasticsearch import Elasticsearch

# My Library
import basicconfig, mybasics
from mybasics.cBaseClass import BaseClass, BaseClassErrors
from mybasics.cHelpers import HandleErrors, LogLvl

# This imports several constants, a couple of functions, and a helper class
from basicconfig.startup_config import *

# Connect to ElasticSearch
es = Elasticsearch([{'host': 'localhost', 'port': '9200'}])

# Create mappings of a visit
time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # This originally included 'index': 'not_analyzed' as well

visit_mapping = {
    'properties': {
        'uniqID': str_not_analyzed,
        'pages': str_not_analyzed,
        'domain': str_not_analyzed,
        'Srvr IP': str_not_analyzed,
        'Visitor IP': str_not_analyzed,
        'Agent': { 'type': 'string', 'index': 'not_analyzed' },
        'Referrer': { 'type': 'string' },
        'Entrance Time': time_date_mapping,
        'Request Time': time_date_mapping,
        'Raw': { 'type': 'string', 'index': 'not_analyzed' },
        'Pages': { 'type': 'string', 'index': 'not_analyzed' },
    },
}


class Visit_to_ElasticSearch(object):
    """

    """

    INDEX = 'visits'
    DOC_TYPE = 'visit'



    def __init__(self, fname, index=True):
        """

        """

        self._visit = json.loads(php_basic_files.file_get_contents(fname))
        self._pages = self._visit.pop('pages')

        self.uniqID = self._visit['uniqID']
        self.domain = self._visit['domain']
        self.entrance_time = self._convert_time(self._visit['Entrance Time'])

        # Get a list of the page IDs
        self.pages = self._pages.keys()

        # Extra IPs and such from a single page
        page = self._pages[self.pages[0]]
        srvr = page['SERVER']
        req = page['REQUEST']

        self.visitor_ip = srvr['REMOTE_ADDR']
        self.srvr_ip = srvr['SERVER_ADDR']
        self.request_time = self._convert_time(srvr['REQUEST_TIME'])

        self.agent = srvr['HTTP_USER_AGENT']

        # Now go grab data that might not be there...
        self._extract_optional()

        if index is True:
            self.index_with_elasticsearch()


    def _convert_time(self, ts):
        """

        """

        try:
            dt = datetime.fromtimestamp(ts)
        except TypeError:
            dt = datetime.fromtimestamp(float(ts))

        return dt.strftime('%Y-%m-%dT%H:%M:%S')         


    def _extract_optional(self):
        """

        """

        self.referrer = ''


    def index_with_elasticsearch(self):
        """

        """

        visit = {
            'uniqID': self.uniqID,
            'pages': [],
            'domain': self.domain,
            'Srvr IP': self.srvr_ip,
            'Visitor IP': self.visitor_ip,
            'Agent': self.agent,
            'Referrer': self.referrer,
            'Entrance Time': self.entrance_time,
            'Request Time': self.request_time,
            'Raw': self._visit,
            'Pages': php_basic_str.implode(', ', self.pages),
        }

        es.index(
            index=Visit_to_ElasticSearch.INDEX,
            doc_type=Visit_to_ElasticSearch.DOC_TYPE,
            id=self.uniqID,
            timestamp=int(math.floor(self._visit['Entrance Time'])),
            body=visit
        )   


es.indices.create(
    index=Visit_to_ElasticSearch.INDEX,
    body={
        'settings': {
            'number_of_shards': 5,
            'number_of_replicas': 1,
        }
    },
    # ignore already existing index
    ignore=400
)

如果重要,这是我用来将数据转储到ES中的简单循环:

for f in all_files:
    try:
        visit = mapping.Visit_to_ElasticSearch(f)
    except IOError:
        pass

其中all_files是我在测试数据集中的所有访问文件(完整路径)的列表。

以下是Google Bot访问的示例访问文件:

{u'Entrance Time': 1407551587.7385,
     u'domain': u'############',
     u'pages': {u'6818555600ccd9880bf7acef228c5d47': {u'REQUEST': [],
       u'SERVER': {u'DOCUMENT_ROOT': u'/var/www/####/',
        u'Entrance Time': 1407551587.7385,
        u'GATEWAY_INTERFACE': u'CGI/1.1',
        u'HTTP_ACCEPT': u'*/*',
        u'HTTP_ACCEPT_ENCODING': u'gzip,deflate',
        u'HTTP_CONNECTION': u'Keep-alive',
        u'HTTP_FROM': u'googlebot(at)googlebot.com',
        u'HTTP_HOST': u'############',
        u'HTTP_IF_MODIFIED_SINCE': u'Fri, 13 Jun 2014 20:26:33 GMT',
        u'HTTP_USER_AGENT': u'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
        u'PATH': u'/usr/local/bin:/usr/bin:/bin',
        u'PHP_SELF': u'/index.php',
        u'QUERY_STRING': u'',
        u'REDIRECT_SCRIPT_URI': u'http://############/',
        u'REDIRECT_SCRIPT_URL': u'############',
        u'REDIRECT_STATUS': u'200',
        u'REDIRECT_URL': u'############',
        u'REMOTE_ADDR': u'############',
        u'REMOTE_PORT': u'46271',
        u'REQUEST_METHOD': u'GET',
        u'REQUEST_TIME': u'1407551587',
        u'REQUEST_URI': u'############',
        u'SCRIPT_FILENAME': u'/var/www/PIAN/index.php',
        u'SCRIPT_NAME': u'/index.php',
        u'SCRIPT_URI': u'http://############/',
        u'SCRIPT_URL': u'/############/',
        u'SERVER_ADDR': u'############',
        u'SERVER_ADMIN': u'admin@############',
        u'SERVER_NAME': u'############',
        u'SERVER_PORT': u'80',
        u'SERVER_PROTOCOL': u'HTTP/1.1',
        u'SERVER_SIGNATURE': u'<address>Apache/2.2.22 (Ubuntu) Server at ############ Port 80</address>\n',
        u'SERVER_SOFTWARE': u'Apache/2.2.22 (Ubuntu)',
        u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'},
       u'SESSION': {u'Entrance Time': 1407551587.7385,
        u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}}},
     u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}

0 个答案:

没有答案