Question

我正在尝试构建一个负责IP地址的地理位置的服务。打开的IP地址数据库是以下格式的CSV文件：starting_ip，ending_ip，region

所以我正在考虑将IP转换为整数，并试图查看给定的整数是否在开始和结束的范围内......但是在这一点上，我不太清楚这种比较是如何在一种有效的方法，考虑到500K条目的大小。

起初我试图使用以下词典将所有内容加载到内存中：

{(ip_start, ip_end): 'region', ....}

但是此时我还没有看到如何通过IP地址在这个字典中找到密钥。

Answer 1

假设范围不重叠，您可以按ip_start对它们进行一次排序，然后使用二分搜索来查找候选范围。找到候选范围后，您只需检查IP地址是否介于ip_start和ip_end之间。

您可以使用内置的bisect模块执行二进制搜索。

这会产生O(logn)查询费用。

Answer 2

我建议您按照您喜欢的任何可用格式对数据进行排序，但使用sortedtcontainers SortedDict 将允许您在 log n 一旦你有一个排序的集合，按开始ip排序：

import csv from sortedcontainers import sorteddict with open("ips.csv") as f: ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"] reader = csv.reader(f) # Use start ip as the key, creating tuple or using netaddr to turn into an int sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader) for ip in ips: # do the same for the ip you want to search for ip = tuple(map(int, ip.split("."))) # bisect to see where the ip would land ind = sorted_dict.bisect(ip) - 1 start_ip = sorted_dict.iloc[ind] end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split("."))) print(start_ip, ip, end_ip) print(start_ip <= ip <= end_ip)

如果我们在测试文件上运行代码：

In [5]: !cat ips.csv 192.168.43.100,192.168.43.130,foo 192.168.27.1,192.168.27.12,foobar 192.168.1.1,192.168.1.98,bar 192.168.43.131,192.168.43.140,bar 10.10.131.10,10.10.131.15,barf 10.10.145.10,10.10.145.100,foob In [6]: import csv In [7]: from sortedcontainers import sorteddict In [8]: with open("ips.csv") as f: ...: ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"] ...: reader = csv.reader(f) ...: sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader) ...: for ip in ips: ...: ip = tuple(map(int, ip.split("."))) ...: ind = sorted_dict.bisect(ip) - 1 ...: start_ip = sorted_dict.iloc[ind] ...: end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split("."))) ...: print(start_ip,ip, end_ip) ...: print(start_ip <= ip <= end_ip) ...: (192, 168, 43, 100) (192, 168, 43, 102) (192, 168, 43, 130) True (10, 10, 145, 10) (10, 10, 145, 100) (10, 10, 145, 100) True (192, 168, 1, 1) (192, 168, 1, 1) (192, 168, 1, 98) True (192, 168, 27, 1) (192, 168, 43, 99) (192, 168, 27, 12) False (10, 10, 145, 10) (127, 0, 0, 1) (10, 10, 145, 100) False

您还可以修改bisect_right以仅考虑第一个元素并使用常规python列表：

def bisect_right(a, x, lo=0, hi=None): if lo < 0: raise ValueError('lo must be non-negative') if hi is None: hi = len(a) while lo < hi: mid = (lo+hi) // 2 if x < a[mid][0]: hi = mid else: lo = mid + 1 return lo with open("ips.csv") as f: ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99", "127.0.0.1"] reader = csv.reader(f) sorted_data = sorted(((tuple(map(int, sip.split("."))), eip, rnge) for sip, eip, rnge in reader)) for ip in ips: ip = tuple(map(int, ip.split("."))) ind = bisect_right(sorted_data, ip) - 1 ip_sub = sorted_data[ind] start_ip, end_ip, _ = sorted_data[ind] end_ip = tuple(map(int, end_ip.split("."))) print(start_ip, ip, end_ip) print(start_ip <= ip <= end_ip)

结果将是相同的，我认为使用SortedDict几乎肯定会更快，因为bisect在c级完成。

查找IP地址是否在地址范围内

2 个答案: