Pandas groupby(运行时间+ 15分钟)

时间:2018-04-26 08:01:48

标签: python performance pandas pandas-groupby dask

我正在尝试使用+1.000.000数据包分析网络流量数据集,并且我有以下代码:

pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv')
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data.parse_dates=["time"]

pcap_data['num'] = 1
df = pcap_data
df    
%%time
df['time'] = pd.to_datetime(df['time'])
df.index = df['time']
data = df.copy()

data_group = pd.DataFrame({'count': data.groupby(['ipdst', 'proto', data.index]).size()}).reset_index()
pd.options.display.float_format = '{:,.0f}'.format
data_group.index = data_group['time']
data_group

data_group2 = data_group.groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index().dropna()
data_group2

导入.csv运行时脚本的第一部分是5秒,但是当pandas groupby IP + PROTO,并在5s重新采样时,运行时间是15分钟,有谁知道我怎么能更好性能

修改

现在我正在尝试使用dask,我有以下代码:

导入.csv

filename = '/home/alexfrancow/AAA/data1.csv'
df = dd.read_csv(filename)
df.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'info']
df.parse_dates=["time"]
df['num'] = 1
%time df.head(2)

Out

按5d freq分组ipdst + proto

df.set_index('time').groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index()

如何通过5S频率按IP + PROTO进行分组?

2 个答案:

答案 0 :(得分:0)

我尝试稍微简化您的代码,但如果大型DataFrame性能应该只是更好一点:

pd.options.display.float_format = '{:,.0f}'.format

#convert time column to DatetimeIndex
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv', 
                         parse_dates=['time'], 
                         index_col=['time'])
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data['num'] = 1

#remove DataFrame constructor
data_group = pcap_data.groupby(['ipdst', 'proto', 'time']).size().reset_index(name='count')

data_group2 = (data_group.set_index('time')
                         .groupby(['ipdst','proto'])
                         .resample('5S')
                         .sum()
                         .reset_index()
                         .dropna())

答案 1 :(得分:-1)

速成:

<Self as Actor>::Context

希望它对您有用