Question

我有一个目录，里面装满了非常大的csv文件，这些文件已经从pcap转换为csv。

我正在尝试遍历该目录中的每个csv文件并获取最常见的源IP地址（第2列）。

目前我的输出不正确，因为我似乎设法让每个文件在启动之前将其值转储到下一个文件中。每个文件似乎都有相同的IP，我知道情况并非如此。

ipCounter = collections.Counter()

#iterate through all of the files in the directory, using glob
for filename in glob.glob('/path/to/directory/*'):
    with open(filename) as input_file:
        #skip column titles
        input_file.next()

        for row in csv.reader(input_file, delimiter=','):
            ipCounter[row[2]] += 1

    print 'Source IPs most common in: %s' % filename
    print ipCounter.most_common()

我不是Python的专业人士，所以可能有更好的方法来做到这一点，但这是我到目前为止所做的。

Answer 1

你的方法看起来很好。如果您想要执行每个文件most_common()，但您需要在for循环中移动计数器。或者有两个计数器，一个给你一个文件的总数，第二个给你整个文件夹的整体计数：

import collections
import glob

ip_counter_all = collections.Counter()    

for filename in glob.glob('ip*.csv'):
    ip_counter = collections.Counter()

    with open(filename) as input_file:
        csv_input = csv.reader(input_file)
        header = next(csv_input)

        for row in csv_input:
            ip_counter[row[2]] += 1

    ip_counter_all.update(ip_counter)

    print '\nSource IPs most common in: {}'.format(filename)

    for ip_addr, count in ip_counter.most_common():
        print "  {}  {}".format(ip_addr, count)

print '\nOverall IPs most common:'

for ip_addr, count in ip_counter_all.most_common():
    print "  {}  {}".format(ip_addr, count)

这会给你输出如：

Source IPs most common in: ips.csv
  1.1.1.1  2
  1.2.3.4  1
  1.4.2.3  1

Source IPs most common in: ips2.csv
  1.1.1.1  2
  1.2.3.4  1
  1.4.2.3  1

Overall IPs most common:
  1.1.1.1  4
  1.2.3.4  2
  1.4.2.3  2

您还可以使用较新的format()方法来显示字符串。

从pcap到csv文件的目录中获取最常见的ip

1 个答案: