如何对熊猫中特定列的值求和

时间:2018-09-05 04:32:45

标签: python pandas dataframe

我有一个12GB的csv文件,以及如何根据其他列的值求和特定列

import pandas as pd
df = pd.read_csv("/Users/kiya/sentbytetop.csv", dtype={'ip':str}, names=['url','bytes','ip'])
df1 = df.groupby(['ip', 'url'], as_index=False)['bytes'].sum()
df1['bytes'] /= 10**6
df1= df1.rename(columns={'bytes':'MB'})
print(df1)

我的输出:

url                  bytes      ip
setup.icl.com:443    "600"  "175.11.8.11"         
setup.icl.com:443    "272"  "172.18.8.26"
dap-int.net:443      "243"  "172.16.22.241" 

如何对字节列的唯一IP地址和URL求和

url                    bytes       ip
setup.iclo.com:443    "3633.6 "  "175.11.8.11"         
setup.iclo.com:443    "3676.6 "  "172.18.8.26"
dap-int.net:443       "2647.2"  "172.16.22.241"

2 个答案:

答案 0 :(得分:3)

首先聚合sum,然后除以MB,最后重命名列:

#convert ip column to strings
df = pd.read_csv("/Users/kiya/sentbytetop.csv", dtype={'ip':str})

df1 = df.groupby(['ip', 'url'], as_index=False)['bytes'].sum()
df1['bytes'] /= 10**6
df1= df1.rename(columns={'bytes':'MB'})
print (df1)
              ip                url        MB
0  172.16.22.241    dap-int.net:443  0.000243
1    172.18.8.26  setup.icl.com:443  0.000272
2    175.11.8.11  setup.icl.com:443  0.000600

答案 1 :(得分:2)

主要是您需要将对象中的字节转换为int或float以求和。

df2 = pd.DataFrame([['setup.icl.com:443', "600", "175.11.8.11" ],
                    ['setup.icl.com:443', "272", "172.18.8.26"],
                    ['dap-int.net:443', "243", "172.16.22.241"],
                    ['dap-int.net:443', "243", "172.16.22.241"],
                    ['dap-int.net:441', "243", "172.16.22.241"],
                    ['dap-int.net:441', "243", "172.16.22.241"]],
                   columns=['url', 'bytes', 'ip'])

                  url   bytes        ip
0   setup.icl.com:443   600 175.11.8.11
1   setup.icl.com:443   272 172.18.8.26
2   dap-int.net:443     243 172.16.22.241
3   dap-int.net:443     243 172.16.22.241
4   dap-int.net:441     243 172.16.22.241
5   dap-int.net:441     243 172.16.22.241


df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
url      6 non-null object
bytes    6 non-null object
ip       6 non-null object
dtypes: object(3)
memory usage: 224.0+ bytes

df2['bytes'] = df2['bytes'].astype('float64')

df2.groupby(['ip', 'url'])[['bytes']].sum()

                                     bytes
            ip              url 
172.16.22.241   dap-int.net:441     486.0
                dap-int.net:443     486.0
172.18.8.26     setup.icl.com:443   272.0
175.11.8.11     setup.icl.com:443   600.0

包含从字节到兆字节的转换(每个jezrael

df2['Mbytes'] = df2['bytes'].astype('float64')/10**6

                 url     bytes        ip         Mbytes
0   setup.icl.com:443   600 175.11.8.11     0.000600
1   setup.icl.com:443   272 172.18.8.26     0.000272
2   dap-int.net:443     243 172.16.22.241   0.000243
3   dap-int.net:443     243 172.16.22.241   0.000243
4   dap-int.net:441     243 172.16.22.241   0.000243
5   dap-int.net:441     243 172.16.22.241   0.000243

df3 = df2.groupby(['ip', 'url'])[['Mbytes']].sum()


                                      Mbytes
            ip              url 
172.16.22.241   dap-int.net:441     0.000486
                dap-int.net:443     0.000486
172.18.8.26    setup.icl.com:443    0.000272
175.11.8.11    setup.icl.com:443    0.000600

您可能需要尝试使用timeit来查看哪种方法最快