用于分组IP范围和检测异常值的算法

时间:2018-03-18 15:03:03

标签: algorithm ip bigdata

我有一个像这样的大型数据库(运营商名称,IP):

+-------+----------------+
|Name   |IP              |
+-------+----------------+
|A      |41.74.63.255    |
+-------+----------------+
|B      |168.167.255.255 |
+-------+----------------+
+ ...   | ...            |

我想基于数据库 IE中的出现频率自动为运营商构建“有效范围”列表:

Operator "A" -> [(range A), (range B), (range C)...] 
Operator "B" -> [(range A)...]

某种群集,然后检测来自运营商A的新IP是否属于有效群集,还是异常群。

什么是开始的好地方?

1 个答案:

答案 0 :(得分:1)

可能有两种方法。

  1. 某种merge sort
  2. 某种映射(有某些假设)。
  3. 合并排序

    或者更确切地说排序然后合并。直接的方法,易于理解和实施,但可能是缓慢而低效的。

    SRC_LIST = (operator, ip) // source list of (operator, IP)
    SORTED_IPS[] = (ip) // map of per-operator lists of sorted IPs
    RANGES[] = (startIP, endIP) // map of per-operator lists of ranges
    
    // sort IPs to per-operator lists of IPs
    for E in SRC_LIST
        SortInsertIP(SORTED_IPS[E.operator], E.ip)
    
    // merge continuous IPs into ranges
    for OP in SORTED_IPS
        for IP in SORTED_IPS[OP]
            MergeIP(RANGES[OP], IP)
    
    // sort merged lists based on their appearance frequency
    

    种类的映射

    这可能非常有效和快速,但是几乎没有先决条件:

    1. 只有IPv4地址,即没有IPv6。
    2. 数据库中的IP地址必须是/ 24前缀,即始终以.255.0
    3. 结尾
    4. operators的数量是有限的,理想情况下最多为255以适合一个字节。
    5. 如果这些是真的,我们可能只使用IPv4的前3个字节作为operators的2 ^ 24表的索引。然后我们只合并连续的索引。

      SRC_LIST = (operator, ip) // source list of (operator, IP)
      OPERATORS[] - (idx) // map each operator to an index starting form 1
      TBL24[2^24] = (op_idx) // table of 2^24 operators, where 0 is unused entry
      RANGES[] = (startIP, endIP) // map of per-operator lists of ranges
      
      // map DB to TBL24 table
      for E in SRC_LIST
          op_idx = GetOperatorIndex(OPERATORS[], E.operator)
          ip_idx = uint32(E.ip) >> 8 // convert 32-bit IPv4 to 24-bit index
          TBL24[ip_idx] = op_idx
      
      // find consecutive operators in the map
      startIdx = 0
      while startIdx < 2^24
          endIdx = startIdx + 1
          while TBL24[startIdx] == DIR[endIdx]
              endIdx = endIdx + 1
              if endIdx == 2^24
                  break
      
          // append found range to the per-operator list
          if TBL24[startIdx] != 0 // i.e. non-empty
              AppendList(RANGES[TBL24[startIdx]], (startIdx << 24, endIdx << 24))
      
          startIdx = endIdx + 1
      
      // sort merged lists based on their appearance frequency
      

      有一种方法不仅可以使用/ 24前缀,还可以使用更具体的前缀。请查看Routing Lookups in Hardware at Memory Access Speeds论文或DPDK中的software implementation of DIR-24-8 algorithm