我希望对特定列的每行/单元格中的条目进行排序' D'通过IP地址按升序排列。这些条目存储在一个新行上,并在IP的末尾列出了相关的协议和端口,我不关心只对IP地址的4个八位字节进行排序。我觉得这需要某种具有某种lambda功能的reg ex。有时可能有主机名而不是IP地址。
示例数据框将是:
ID A B C D
1 x x x 10.0.0.50/TCP/50
192.168.1.90/TCP/51
server1/TCP/80
10.0.0.9/TCP/78
2 y y y 192.168.3.90/UDP/53
10.0.4.10/TCP/65
10.0.3.4/TCP/34
host1/UDP/80
3 z z z 10.0.0.40/TCP/80
10.0.0.41/TCP/443
192.168.2.70/UDP/98
10.0.0.9/TCP/12
所需的输出是:
ID A B C D
1 x x x 10.0.0.9/TCP/78
10.0.0.50/TCP/50
192.168.1.90/TCP/51
server1/TCP/80
2 y y y 10.0.3.4/TCP/34
10.0.4.10/TCP/65
192.168.3.90/UDP/53
host1/UDP/80
3 z z z 10.0.0.9/TCP/12
10.0.0.40/TCP/34
10.0.0.41/TCP/443
192.168.2.70/UDP/98
为了实现上述数据帧,我最初使用groupby将行D组合起来,但是IP地址不按顺序排列:
df = df.groupby(['ID','A','B','C'], sort=False, as_index=False)['D'].apply('\n'.join)
如果可能的话,同时组合和排序可能更有效,而不是2个单独的命令??
任何想法都非常感激我已经看了几个例子,但似乎都不合适。希望我的解释很清楚,提前感谢任何帮助。
答案 0 :(得分:1)
假设你有原始的DF,之前分组:
In [70]: df
Out[70]:
ID A B C D
0 1.0 x x x 10.0.0.50/TCP/50
1 1.0 x x x 192.168.1.90/TCP/51
2 1.0 x x x server1/TCP/80
3 1.0 x x x 10.0.0.9/TCP/78
4 2.0 y y y 192.168.3.90/UDP/53
5 2.0 y y y 10.0.4.10/TCP/65
6 2.0 y y y 10.0.3.4/TCP/34
7 2.0 y y y host1/UDP/80
8 3.0 z z z 10.0.0.40/TCP/80
9 3.0 z z z 10.0.0.41/TCP/443
10 3.0 z z z 192.168.2.70/UDP/98
11 3.0 z z z 10.0.0.9/TCP/12
选项1:多指数DF:
In [69]: (df.assign(x=df.D.replace(['/.*',r'\b(\d{1})\b',r'\b(\d{2})\b'],
...: ['',r'00\1',r'0\1'],
...: regex=True))
...: .sort_values('x')
...: .groupby(['ID','A','B','C'], sort=False, as_index=False)['D']
...: .apply('\n'.join)
...: .to_frame('D'))
...:
...:
Out[69]:
D
ID A B C
1.0 x x x 10.0.0.9/TCP/78\n10.0.0.50/TCP/50\n192.168.1.9...
3.0 z z z 10.0.0.9/TCP/12\n10.0.0.40/TCP/80\n10.0.0.41/T...
2.0 y y y 10.0.3.4/TCP/34\n10.0.4.10/TCP/65\n192.168.3.9...
选项2:常规DF:
In [75]: (df.assign(x=df.D.replace(['/.*',r'\b(\d{1})\b',r'\b(\d{2})\b'],
...: ['',r'00\1',r'0\1'],
...: regex=True))
...: .sort_values('x')
...: .groupby(['ID','A','B','C'], sort=False, as_index=False)['D']
...: .apply('\n'.join)
...: .reset_index(name='D'))
...:
...:
Out[75]:
ID A B C D
0 1.0 x x x 10.0.0.9/TCP/78\n10.0.0.50/TCP/50\n192.168.1.9...
1 3.0 z z z 10.0.0.9/TCP/12\n10.0.0.40/TCP/80\n10.0.0.41/T...
2 2.0 y y y 10.0.3.4/TCP/34\n10.0.4.10/TCP/65\n192.168.3.9...
以下内容可能有助于了解它的工作原理:
添加虚拟列x
,填充IP八位字节为零:
In [71]: df.assign(x=df.D.replace(['/.*',r'\b(\d{1})\b',r'\b(\d{2})\b'],
...: ['',r'00\1',r'0\1'],
...: regex=True))
...:
...:
Out[71]:
ID A B C D x
0 1.0 x x x 10.0.0.50/TCP/50 010.000.000.050
1 1.0 x x x 192.168.1.90/TCP/51 192.168.001.090
2 1.0 x x x server1/TCP/80 server1
3 1.0 x x x 10.0.0.9/TCP/78 010.000.000.009
4 2.0 y y y 192.168.3.90/UDP/53 192.168.003.090
5 2.0 y y y 10.0.4.10/TCP/65 010.000.004.010
6 2.0 y y y 10.0.3.4/TCP/34 010.000.003.004
7 2.0 y y y host1/UDP/80 host1
8 3.0 z z z 10.0.0.40/TCP/80 010.000.000.040
9 3.0 z z z 10.0.0.41/TCP/443 010.000.000.041
10 3.0 z z z 192.168.2.70/UDP/98 192.168.002.070
11 3.0 z z z 10.0.0.9/TCP/12 010.000.000.009
按虚拟列x
排序DF:
In [72]: (df.assign(x=df.D.replace(['/.*',r'\b(\d{1})\b',r'\b(\d{2})\b'],
...: ['',r'00\1',r'0\1'],
...: regex=True))
...: .sort_values('x'))
...:
...:
Out[72]:
ID A B C D x
3 1.0 x x x 10.0.0.9/TCP/78 010.000.000.009
11 3.0 z z z 10.0.0.9/TCP/12 010.000.000.009
8 3.0 z z z 10.0.0.40/TCP/80 010.000.000.040
9 3.0 z z z 10.0.0.41/TCP/443 010.000.000.041
0 1.0 x x x 10.0.0.50/TCP/50 010.000.000.050
6 2.0 y y y 10.0.3.4/TCP/34 010.000.003.004
5 2.0 y y y 10.0.4.10/TCP/65 010.000.004.010
1 1.0 x x x 192.168.1.90/TCP/51 192.168.001.090
10 3.0 z z z 192.168.2.70/UDP/98 192.168.002.070
4 2.0 y y y 192.168.3.90/UDP/53 192.168.003.090
7 2.0 y y y host1/UDP/80 host1
2 1.0 x x x server1/TCP/80 server1