Question

我有一本很大的字典，有2亿个键。键是具有整数的元组，作为元组的各个元素。我想搜索“查询整数”位于字典键中元组的两个整数内的键。

当前，我正在遍历所有字典键，并将整数与元组的每个元素进行比较（如果它在该范围内）。它可以工作，但是查找每个查询的时间约为1-2分钟，而我需要执行大约100万个此类查询。字典的示例和我编写的代码如下：

示例字典：

[{ (3547237440, 3547237503) : {'state': 'seoul teukbyeolsi', 'country': 'korea (south)', 'country_code': 'kr', 'city': 'seoul'} },
{ (403044176, 403044235) : {'state': 'california', 'country': 'united states', 'country_code': 'us', 'city': 'pleasanton'} },
{ (3423161600, 3423161615) : {'state': 'kansas', 'country': 'united states', 'country_code': 'us', 'city': 'lenexa'} },
{ (3640467200, 3640467455) : {'state': 'california', 'country': 'united states', 'country_code': 'us', 'city': 'san jose'} },
{ (853650485, 853650485) : {'state': 'colorado', 'country': 'united states', 'country_code': 'us', 'city': 'arvada'} },
{ (2054872064, 2054872319) : {'state': 'tainan', 'country': 'taiwan', 'country_code': 'tw', 'city': 'tainan'} },
{ (1760399104, 1760399193) : {'state': 'texas', 'country': 'united states', 'country_code': 'us', 'city': 'dallas'} },
{ (2904302140, 2904302143) : {'state': 'iowa', 'country': 'united states', 'country_code': 'us', 'city': 'hampton'} },
{ (816078080, 816078335) : {'state': 'district of columbia', 'country': 'united states', 'country_code': 'us', 'city': 'washington'} },
{ (2061589204, 2061589207) : {'state': 'zhejiang', 'country': 'china', 'country_code': 'cn', 'city': 'hangzhou'} }]

我编写的代码：

ipint=int(ipaddress.IPv4Address(ip))
for k in ip_dict.keys():
    if ipint >= k[0] and ipint <= k[1]:
       print(ip_dict[k]['country'], ip_dict[k]['country_code'], ip_dict[k]['state'])

其中ip只是IP地址，例如“ 192.168.0.1”。

如果任何人都可以提供有关执行此任务的更有效方式的提示，将不胜感激。

谢谢

Answer 1

我建议您使用查询结构复杂的另一种结构，例如一棵树。

也许您可以尝试我刚刚发现的https://pypi.org/project/rangetree/

这个图书馆

正如他们所说，它是针对查找而不是插入而优化的，因此，如果您需要插入一次并多次插入，则应该可以。

另一种解决方案是不使用字典，而是使用列表，以对其进行排序并在其上建立索引。发生查询时对此索引进行二分法（如果范围不规则，可能会不太理想，所以我更喜欢第一个解决方案）

Answer 2

为2个整数中的每一个创建索引：像这样的排序列表：

[(left_int, [list_of_row_ids_that have_this_left_int]),
 (another_greater_left_int, [...])]

然后，您可以搜索左int大于在log（n）中搜索到的int的所有行。二进制搜索将在这里进行。

对正确的int进行相同的操作。

将其余数据保留在元组列表中。

更多详细信息：

data = [( (3547237440, 3547237503), {'state': 'seoul'} ), ...]
left_idx = [(3547237440, [0,43]), (9547237440, [3])]
# 0, 43, 3 are indices in the data list
# search 
min_left_idx = binary_search(left_idx, 3444444)
# now all rows referred to by left_idx[min_left_idx] ... left_idx[-1] will satisfy your criteria
min_right_idx = ...
# between these 2 all referred rows satisfy the range check
# intersect the sets

以键为元组的字典有效循环

2 个答案: