我有一个用户及其IP地址的熊猫数据框:
users_df = pd.DataFrame({'id': [1,2,3],
'ip': ['96.255.18.236','105.49.228.135','104.236.210.234']})
id ip
0 1 96.255.18.236
1 2 105.49.228.135
2 3 104.236.210.234
以及包含网络范围和相应的地理名称ID的单独数据框:
geonames_df = pd.DataFrame({'network': ['96.255.18.0/24','105.49.224.0/19','104.236.128.0/17'],
'geoname': ['4360369.0','192950.0','5391959.0']})
geoname network
0 4360369.0 96.255.18.0/24
1 192950.0 105.49.224.0/19
2 5391959.0 104.236.128.0/17
对于每个用户,我需要针对所有网络检查其ip,并拉出相应的地理名称并将其添加到users_df
。我希望将其作为输出:
id ip geonames
0 1 96.255.18.236 4360369.0
1 2 105.49.228.135 192950.0
2 3 104.236.210.234 5391959.0
在此示例中很容易,因为它们的顺序正确并且只有3个示例。实际上,users_df
有4000行,geonames_df
有300万以上
我当前正在使用此
import ipaddress
networks = []
for n in geonames_df['network']:
networks.append(ipaddress.ip_network(n))
geonames = []
for idx, row in users_df.iterrows():
ip_address = ipaddress.IPv4Address(row['ip'])
for block in networks:
if ip_address in block:
geonames.append(str(geonames_df.loc[geonames_df['network'] == str(block), 'geoname'].item()))
break
users_df['geonames'] = geonames
由于数据框/列表上的嵌套循环,这非常慢。有没有一种更快的方法来利用numpy / pandas?还是至少某种比上述方法更快的方法?
对此也有类似的问题(How can I check if an ip is in a network in python 2.x?),但是1)它不涉及pandas / numpy,2)我想针对多个网络检查多个IP,以及3)投票率最高的答案无法避免嵌套循环,这是我性能下降的原因
答案 0 :(得分:0)
我认为不能避免嵌套循环,但是我将前面提到的解决方案与熊猫结合了起来。您可以检查速度是否更快。
import socket,struct
def makeMask(n):
"return a mask of n bits as a long integer"
return (2<<n-1) - 1
def dottedQuadToNum(ip):
"convert decimal dotted quad string to long integer"
return struct.unpack('L',socket.inet_aton(ip))[0]
def networkMask(network):
"Convert a network address to a long integer"
return dottedQuadToNum(network.split('/')[0]) & makeMask(int(network.split('/')[1]))
def whichNetwork(ip):
"return the network to which the ip belongs"
numIp = dottedQuadToNum(ip)
for index,aRow in geonames_df.iterrows():
if (numIp & aRow["Net"] == aRow["Net"]):
return aRow["geoname"]
return "Not Found"
geonames_df["Net"] = geonames_df["network"].map(networkMask)
users_df["geonames"] = users_df["ip"].map(whichNetwork)
答案 1 :(得分:0)
如果您愿意使用R代替Python,我已经写了一个ipaddress软件包可以解决这个问题。仍然存在一个底层循环,但是它是用C ++实现的(快得多!)
library(tibble)
library(ipaddress)
library(fuzzyjoin)
addr <- tibble(
id = 1:3,
address = ip_address(c("96.255.18.236", "105.49.228.135", "104.236.210.234"))
)
nets <- tibble(
network = ip_network(c("96.255.18.0/24", "105.49.224.0/19", "104.236.128.0/17")),
geoname = c("4360369.0", "192950.0", "5391959.0")
)
fuzzy_left_join(addr, nets, c("address" = "network"), is_within)
#> # A tibble: 3 x 4
#> id address network geoname
#> <int> <ip_addr> <ip_netwk> <chr>
#> 1 1 96.255.18.236 96.255.18.0/24 4360369.0
#> 2 2 105.49.228.135 105.49.224.0/19 192950.0
#> 3 3 104.236.210.234 104.236.128.0/17 5391959.0
由reprex package(v0.3.0)于2020-09-02创建