提取特定列中的值并处理pandas中的其他列值

时间:2016-07-28 13:11:14

标签: python csv pandas

以下是我的数据帧的简化blob。我想处理

first.csv

No.,Time,Source,Destination,Protocol,Length,Info,src_dst_pair
325778,112.305107,02:e0,Broadcast,ARP,64,Who has 253.244.230.77?  Tell 253.244.230.67,"('02:e0', 'Broadcast')"
801130,261.868118,02:e0,Broadcast,ARP,64,Who has 253.244.230.156?  Tell 253.244.230.67,"('02:e0', 'Broadcast')"
700094,222.055094,02:e0,Broadcast,ARP,60,Who has 253.244.230.77?  Tell 253.244.230.156,"('02:e0', 'Broadcast')"
766543,247.796156,100.118.138.150,41.177.26.176,TCP,66,32222 > http [SYN] Seq=0,"('100.118.138.150', '41.177.26.176')"
767405,248.073313,100.118.138.150,41.177.26.176,TCP,64,32222 > http [ACK] Seq=1,"('100.118.138.150', '41.177.26.176')"
767466,248.083268,100.118.138.150,41.177.26.176,HTTP,380,Continuation [Packet capture],"('100.118.138.150', '41.177.26.176')"
891394,294.989813,105.144.38.121,41.177.26.15,TCP,66,48852 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.121', '41.177.26.15')"
892285,295.320654,105.144.38.121,41.177.26.15,TCP,64,48852 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.121', '41.177.26.15')"
892287,295.321003,105.144.38.121,41.177.26.15,HTTP,350,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.121', '41.177.26.15')"
893306,295.652079,105.144.38.121,41.177.26.15,TCP,64,48852 > http [ACK] Seq=293 Ack=609 Win=64928 Len=0,"('105.144.38.121', '41.177.26.15')"
893307,295.652233,105.144.38.121,41.177.26.15,TCP,64,"48852 > http [FIN, ACK] Seq=293 Ack=609 Win=64928 Len=0","('105.144.38.121', '41.177.26.15')"
885501,294.070377,105.144.38.139,41.177.26.15,TCP,66,48810 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.139', '41.177.26.15')"
887786,294.402349,105.144.38.139,41.177.26.15,TCP,64,48810 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.139', '41.177.26.15')"
887788,294.402642,105.144.38.139,41.177.26.15,HTTP,371,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.139', '41.177.26.15')"
890133,294.732297,105.144.38.139,41.177.26.15,TCP,64,"48810 > http [FIN, ACK] Seq=314 Ack=629 Win=64907 Len=0","('105.144.38.139', '41.177.26.15')"
890154,294.733413,105.144.38.139,41.177.26.15,TCP,64,48810 > http [ACK] Seq=315 Ack=630 Win=64907 Len=0,"('105.144.38.139', '41.177.26.15')"
902758,297.792645,105.144.38.164,41.177.26.15,TCP,66,49005 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.164', '41.177.26.15')"
903926,298.123157,105.144.38.164,41.177.26.15,TCP,64,49005 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.164', '41.177.26.15')"
903932,298.123369,105.144.38.164,41.177.26.15,HTTP,350,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.164', '41.177.26.15')"
905269,298.455368,105.144.38.164,41.177.26.15,TCP,64,49005 > http [ACK] Seq=293 Ack=609 Win=64928 Len=0,"('105.144.38.164', '41.177.26.15')"
905273,298.455557,105.144.38.164,41.177.26.15,TCP,64,"49005 > http [FIN, ACK] Seq=293 Ack=609 Win=64928 Len=0","('105.144.38.164', '41.177.26.15')"
906162,298.714281,105.144.38.204,41.177.26.15,TCP,66,49050 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1,"('105.144.38.204', '41.177.26.15')"
907292,299.025951,105.144.38.204,41.177.26.15,TCP,64,49050 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0,"('105.144.38.204', '41.177.26.15')"
907294,299.026985,105.144.38.204,41.177.26.15,HTTP,354,Continuation or non-HTTP traffic[Packet size limited during capture],"('105.144.38.204', '41.177.26.15')"
907811,299.362918,105.144.38.204,41.177.26.15,TCP,64,49050 > http [ACK] Seq=297 Ack=613 Win=64924 Len=0,"('105.144.38.204', '41.177.26.15')"
907812,299.362951,105.144.38.204,41.177.26.15,TCP,64,"49050 > http [FIN, ACK] Seq=297 Ack=613 Win=64924 Len=0","('105.144.38.204', '41.177.26.15')"

如何在熊猫中执行以下操作?对于每个唯一df.src_dst_pair(每行中的最后一个元素):

  1. 检查df.Info是否有[SYN]。如果没有,请跳过该行。

  2. 如果df.Info[SYN],请存储df.Time(表示开始时间)

  3. df.Length开始累积[SYN],直至找到[FIN, ACK]

  4. [FIN, ACK]中找到df.info后,请存储df.Time(表示停止时间)。如果在[FIN, ACK] df.Info中找不到df.src_dst_pair,则跳过df.src_dst_pair

  5. 最后,总结一下结果。

  6. df.src_dst_pair: flow number, (accumulated) df.Length, df.Time(stop)-df.Time(start)
    

    first.csv的预期输出

    ('105.144.38.121', '41.177.26.15') : flow 1, 1118, 0.66242
    ('105.144.38.139', '41.177.26.15') : flow 1, 565,  0.028527
    ('105.144.38.139', '41.177.26.15') : flow 2, 608,  0.662912
    ('105.144.38.204', '41.177.26.15') : flow 1, 612,  0.64867
    

    我的方法:

    import pandas
    import numpy
    
    
    data = pandas.read_csv('first.csv')
    print data
    
    uniq_src_dst_pair = numpy.unique(data.src_dst_pair.ravel())
    print uniq_src_dst_pair
    print len(uniq_src_dst_pair)
    
    # for now only able to sort data based on src_dst_pair, need flow info. 
    result = data.groupby('src_dst_pair').Length.sum()
    print result
    

1 个答案:

答案 0 :(得分:1)

import pandas as pd


def extract_flows(g):
    # Find the location of SYN packets
    is_syn = g['Info'].fillna('').str.contains('\[SYN\]')
    syn = g[is_syn].index.values

    # Find the location of the FIN-ACK packets
    is_finack = g['Info'].fillna('').str.contains('\[FIN, ACK\]')
    finack = g[is_finack].index.values

    # Loop over SYN packets
    runs = []
    for num, start in enumerate(syn, start=1):
        try:
            # Find the first FIN-ACK packet after each SYN packet
            #     If none, raises IndexError
            stop = finack[finack > start][0]
            runs.append([# The flow number counter
                         num,
                         # The time difference between the packets
                         g.loc[stop, 'Time'] - g.loc[start, 'Time'],
                         # The accumulated length
                         g.loc[start:stop, 'Length'].sum()])
        except IndexError:
            break

    # The output must be a DataFrame
    output = (pd.DataFrame(runs, columns=['Flow number', 'Time', 'Length'])
                .set_index('Flow number'))
    return output


df = pd.read_csv('first.csv', usecols = ['src_dst_pair', 'Info', 'Time', 'Length'])

result = df.groupby('src_dst_pair').apply(extract_flows)
print(result)

输出:

                                                    Time  Length
src_dst_pair                       Flow number                  
('105.144.38.121', '41.177.26.15') 1            0.662420   608.0
('105.144.38.139', '41.177.26.15') 1            0.661920   565.0
                                   2            0.662912   608.0
('105.144.38.204', '41.177.26.15') 1            0.648670   612.0

N.B。:OP中的样本数据与链接的first.csv中的样本数据不一致。上面输出中的一些数字与OP处理first.csv所需的输出一致 - 但是其他数字不同,我认为我的是正确的。