使用Pandas df.apply创建新列

时间:2020-10-07 11:02:53

标签: python pandas dataframe netflow

我有一个庞大的NetFlow数据库,(它包含时间戳记,源IP,目标IP,协议,源和目标端口号,交换的数据包,字节等)。我想根据当前行和上一行创建自定义属性。

我想根据当前行的源IP和时间戳来计算新列。我要在逻辑上这样做:

  • 获取当前行的源IP。
  • 获取当前行的时间戳。
  • 基于源IP和时间戳,我想获取整个数据帧中所有与源IP匹配的前几行,并且通信发生在最后半小时内。这非常重要。
  • 对于符合条件(源ip,并且发生在最近半小时内)的行(在我的示例中为Flows),我要计算所有数据包和所有字节的总和和平均值。

One row from the dataset

相关代码段:

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])

df['ts'] = pd.to_datetime(df['ts'])

def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
    ...
return sum

df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)

我知道可能有更好,更有效的方法来执行此操作,但可悲的是,我不是最好的程序员。

谢谢。

2 个答案:

答案 0 :(得分:2)

内联文档

from datetime import timedelta

def fun(df, i):
  # Current timestamp
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes
  last = current - timedelta(minutes=30)
  # Current IP
  ip = df.loc[i, 'sa']
  
  # df matching the criterian
  adf = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == ip)]

  # Return sum and mean
  return adf['pkt'].sum(), adf['pkt'].mean()

# Apply the fun over each row
result = [fun(df, i) for i in df.index]

# Create new columns
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]

答案 1 :(得分:1)

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
        
df['ts'] = pd.to_datetime(df['ts'])
   
def prev_30_ip_sum(df, i):
  #current time from current row
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes 
  last = current - timedelta(minutes=30)

  # Current source address
  sa = df.loc[i, 'sa']

  # new dataframe for timestamp less than 30 min and same ip as current one
  new_df = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == sa)]

  # Return sum and mean
  return new_df['pkt'].sum(), new_df['pkt'].mean()


# Take sa and timestamp of each row and create new dataframe
result = [prev_30_ip_sum(df, i) for i in df.index]

# Create new columns in current database.
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]

refer this to understand timedelta