Question

我有一个HTTP请求日志。包括的功能包括：capture_time，ip，方法，url，内容，user_agent

所有这些信息都在一个csv文件中。

我想在10分钟的间隔内将来自同一IP的所有请求分组。

我该如何使用熊猫呢？

示例数据集：

date ip method url content agent

2019-04-24 23：16：48.742466
187.20.211.99
开机自检
/送货/支票位置
bairro = Vila＆cidade = Lima
Mozilla / 5.0 （iPhone； CPU iPhone OS 12_2，例如Mac OS X）AppleWebKit / 605.1.15 （像Gecko这样的KHTML）Mobile / 15E148

我已经尝试使用groupby方法。

我想将所有请求内容合并为一行（对于使用ip和时间分组的请求）

Answer 1

df.set_index('date', inplace = True)

unnesting(df.resample('10T')['ip'].unique().reset_index(), ['ip']).reset_index(drop = True)

首先，您需要将日期设置为索引。接下来，您需要以10分钟为增量重新采样时间，查看IP列并获取每个时间段的唯一时间。接下来，您需要使用以下功能取消嵌套unique()创建的列表。

##https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/55839330#55839330

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

在此之后，您可以将您计划的内容串联起来。

编辑：

# Set index to the date column
df.set_index('date', inplace = True)

# 10 minutes in nanoseconds 
ns10min=10*60*1000000000

#Calculate the new 10 min.   
df.index = pd.to_datetime(((df.index.astype(np.int64) // ns10min) * ns10min))

#Groupby both index and ip, then look at the first.
df.groupby([df.index, df['ip']]).first()

Answer 2

我使用Ben Pap的方法根据日期对ip进行分组。之后，我得到了一个包含IP和时间间隔的数据帧。为了加入其他列并将其添加到此数据框中，我这样做：

content= []
row_iterator = test.iterrows()
for index, row in row_iterator:
    texto = ""
    resul = df2.loc[(df2[df2.columns[1]] == row[2]) & ((row[0] < df2.index) & (df2.index <  row[0] + pd.Timedelta(minutes=10) ) )]
    for i, (_, current_row) in enumerate(resul.iterrows()):
        texto += " " + current_row.values[2] + " " + current_row.values[3] + " " + current_row.values[4] 
     content.append(texto)

如何使用熊猫对HTTP请求日志进行分组

2 个答案: