我正在尝试构建一个负责IP地址的地理位置的服务。打开的IP地址数据库是以下格式的CSV文件:starting_ip,ending_ip,region
所以我正在考虑将IP转换为整数,并试图查看给定的整数是否在开始和结束的范围内......但是在这一点上,我不太清楚这种比较是如何在一种有效的方法,考虑到500K条目的大小。
起初我试图使用以下词典将所有内容加载到内存中:
{(ip_start, ip_end): 'region', ....}
但是此时我还没有看到如何通过IP地址在这个字典中找到密钥。
答案 0 :(得分:2)
假设范围不重叠,您可以按ip_start
对它们进行一次排序,然后使用二分搜索来查找候选范围。找到候选范围后,您只需检查IP地址是否介于ip_start
和ip_end
之间。
您可以使用内置的bisect
模块执行二进制搜索。
这会产生O(logn)
查询费用。
答案 1 :(得分:0)
我建议您按照您喜欢的任何可用格式对数据进行排序,但使用sortedtcontainers SortedDict 将允许您在 log n 一旦你有一个排序的集合,按开始ip排序:
import csv
from sortedcontainers import sorteddict
with open("ips.csv") as f:
ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"]
reader = csv.reader(f)
# Use start ip as the key, creating tuple or using netaddr to turn into an int
sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader)
for ip in ips:
# do the same for the ip you want to search for
ip = tuple(map(int, ip.split(".")))
# bisect to see where the ip would land
ind = sorted_dict.bisect(ip) - 1
start_ip = sorted_dict.iloc[ind]
end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split(".")))
print(start_ip, ip, end_ip)
print(start_ip <= ip <= end_ip)
如果我们在测试文件上运行代码:
In [5]: !cat ips.csv
192.168.43.100,192.168.43.130,foo
192.168.27.1,192.168.27.12,foobar
192.168.1.1,192.168.1.98,bar
192.168.43.131,192.168.43.140,bar
10.10.131.10,10.10.131.15,barf
10.10.145.10,10.10.145.100,foob
In [6]: import csv
In [7]: from sortedcontainers import sorteddict
In [8]: with open("ips.csv") as f:
...: ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"]
...: reader = csv.reader(f)
...: sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader)
...: for ip in ips:
...: ip = tuple(map(int, ip.split(".")))
...: ind = sorted_dict.bisect(ip) - 1
...: start_ip = sorted_dict.iloc[ind]
...: end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split(".")))
...: print(start_ip,ip, end_ip)
...: print(start_ip <= ip <= end_ip)
...:
(192, 168, 43, 100) (192, 168, 43, 102) (192, 168, 43, 130)
True
(10, 10, 145, 10) (10, 10, 145, 100) (10, 10, 145, 100)
True
(192, 168, 1, 1) (192, 168, 1, 1) (192, 168, 1, 98)
True
(192, 168, 27, 1) (192, 168, 43, 99) (192, 168, 27, 12)
False
(10, 10, 145, 10) (127, 0, 0, 1) (10, 10, 145, 100)
False
您还可以修改bisect_right以仅考虑第一个元素并使用常规python列表:
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi) // 2
if x < a[mid][0]:
hi = mid
else:
lo = mid + 1
return lo
with open("ips.csv") as f:
ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99", "127.0.0.1"]
reader = csv.reader(f)
sorted_data = sorted(((tuple(map(int, sip.split("."))), eip, rnge) for sip, eip, rnge in reader))
for ip in ips:
ip = tuple(map(int, ip.split(".")))
ind = bisect_right(sorted_data, ip) - 1
ip_sub = sorted_data[ind]
start_ip, end_ip, _ = sorted_data[ind]
end_ip = tuple(map(int, end_ip.split(".")))
print(start_ip, ip, end_ip)
print(start_ip <= ip <= end_ip)
结果将是相同的,我认为使用SortedDict几乎肯定会更快,因为bisect在c级完成。