我有一个12GB的csv文件,以及如何根据其他列的值求和特定列
import pandas as pd
df = pd.read_csv("/Users/kiya/sentbytetop.csv", dtype={'ip':str}, names=['url','bytes','ip'])
df1 = df.groupby(['ip', 'url'], as_index=False)['bytes'].sum()
df1['bytes'] /= 10**6
df1= df1.rename(columns={'bytes':'MB'})
print(df1)
我的输出:
url bytes ip
setup.icl.com:443 "600" "175.11.8.11"
setup.icl.com:443 "272" "172.18.8.26"
dap-int.net:443 "243" "172.16.22.241"
如何对字节列的唯一IP地址和URL求和
url bytes ip
setup.iclo.com:443 "3633.6 " "175.11.8.11"
setup.iclo.com:443 "3676.6 " "172.18.8.26"
dap-int.net:443 "2647.2" "172.16.22.241"
答案 0 :(得分:3)
首先聚合sum
,然后除以MB
,最后重命名列:
#convert ip column to strings
df = pd.read_csv("/Users/kiya/sentbytetop.csv", dtype={'ip':str})
df1 = df.groupby(['ip', 'url'], as_index=False)['bytes'].sum()
df1['bytes'] /= 10**6
df1= df1.rename(columns={'bytes':'MB'})
print (df1)
ip url MB
0 172.16.22.241 dap-int.net:443 0.000243
1 172.18.8.26 setup.icl.com:443 0.000272
2 175.11.8.11 setup.icl.com:443 0.000600
答案 1 :(得分:2)
主要是您需要将对象中的字节转换为int或float以求和。
df2 = pd.DataFrame([['setup.icl.com:443', "600", "175.11.8.11" ],
['setup.icl.com:443', "272", "172.18.8.26"],
['dap-int.net:443', "243", "172.16.22.241"],
['dap-int.net:443', "243", "172.16.22.241"],
['dap-int.net:441', "243", "172.16.22.241"],
['dap-int.net:441', "243", "172.16.22.241"]],
columns=['url', 'bytes', 'ip'])
url bytes ip
0 setup.icl.com:443 600 175.11.8.11
1 setup.icl.com:443 272 172.18.8.26
2 dap-int.net:443 243 172.16.22.241
3 dap-int.net:443 243 172.16.22.241
4 dap-int.net:441 243 172.16.22.241
5 dap-int.net:441 243 172.16.22.241
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
url 6 non-null object
bytes 6 non-null object
ip 6 non-null object
dtypes: object(3)
memory usage: 224.0+ bytes
df2['bytes'] = df2['bytes'].astype('float64')
df2.groupby(['ip', 'url'])[['bytes']].sum()
bytes
ip url
172.16.22.241 dap-int.net:441 486.0
dap-int.net:443 486.0
172.18.8.26 setup.icl.com:443 272.0
175.11.8.11 setup.icl.com:443 600.0
包含从字节到兆字节的转换(每个jezrael )
df2['Mbytes'] = df2['bytes'].astype('float64')/10**6
url bytes ip Mbytes
0 setup.icl.com:443 600 175.11.8.11 0.000600
1 setup.icl.com:443 272 172.18.8.26 0.000272
2 dap-int.net:443 243 172.16.22.241 0.000243
3 dap-int.net:443 243 172.16.22.241 0.000243
4 dap-int.net:441 243 172.16.22.241 0.000243
5 dap-int.net:441 243 172.16.22.241 0.000243
df3 = df2.groupby(['ip', 'url'])[['Mbytes']].sum()
Mbytes
ip url
172.16.22.241 dap-int.net:441 0.000486
dap-int.net:443 0.000486
172.18.8.26 setup.icl.com:443 0.000272
175.11.8.11 setup.icl.com:443 0.000600
您可能需要尝试使用timeit来查看哪种方法最快