Question

我有以下行的文件：

date:ip num#1 num#2   

2013.09:142.134.35.17 10 12
2013.09:142.134.35.17 4 4
2013.09:63.151.172.31 52 13
2013.09:63.151.172.31 10 10
2013.09:63.151.172.31 16 32
2013.10:62.151.172.31 16 32

如何用相同的IP总结最后两个元素才能得出这样的结论？

2013.09:142.134.35.17 14 16
2013.09:63.151.172.31 78 55
2013.10:62.151.172.31 16 32

Answer 1

试试这个：

from collections import Counter
with open('full_megalog.txt') as f:
    data = [d.split() for d in f]

sum1, sum2 = Counter(), Counter()

for d in data:
    sum1[d[0]] += int(d[1])
    sum2[d[0]] += int(d[2])

for date_ip in sum1.keys():
    print date_ip, sum1[date_ip], sum2[date_ip]

Answer 2

你可以这样做：

addrs='''\
2013.09:142.134.35.17 10 12
2013.09:142.134.35.17 4 4
2013.09:63.151.172.31 52 13
2013.09:63.151.172.31 10 10
2013.09:63.151.172.31 16 32
2013.10:62.151.172.31 16 32'''

class Dicto(dict):
    def __missing__(self, key):
        self[key]=[0,0]
        return self[key]

r=Dicto()
for line in addrs.splitlines():
    ip,n1,n2=line.split()
    r[ip][0]+=int(n1)
    r[ip][1]+=int(n2)

print r   
# {'2013.09:142.134.35.17': [14, 16], 
   '2013.09:63.151.172.31': [78, 55], 
   '2013.10:62.151.172.31': [16, 32]}

或者，如果您愿意，可以使用defaultdict：

from collections import defaultdict
r=defaultdict(lambda: [0,0])
for line in addrs.splitlines():
    ip,n1,n2=line.split()
    r[ip][0]+=int(n1)
    r[ip][1]+=int(n2)

print r

Answer 3

编辑@ piokuc的答案，因为他特别要求ip，而不是日期+ ip。拆分和求和只在ip上完成。

from collections import Counter
import re
data=\
"""2012.09:142.134.35.17 10 12
2013.09:142.134.35.17 4 4
2013.09:63.151.172.31 52 13
2013.09:63.151.172.31 10 10
2013.09:63.151.172.31 16 32
2013.10:62.151.172.31 16 32"""


data = [re.split('[: ]',d) for d in data.split('\n')]
print data
sum1 = Counter()
sum2 = Counter()
for d in data:
    sum1[d[1]] += int(d[2])
    sum2[d[1]] += int(d[3])

for date_ip in sum1.keys():
    print date_ip, sum1[date_ip], sum2[date_ip]

Answer 4

@ piokuc的回答非常好;这是一个天真的解决方案，对于初学者来说应该很容易理解，而不必进入Counter的标准库。

您要查找的结果是一组两个（有序）值，每个值与唯一标签（date:ip值）相关联。在Python中，此类任务的基本数据结构是dict（字典）。

当您打开文件以确保在不再需要时关闭文件时，这是一种很好的做法。我将使用with语句;如果您对有关其工作方式的更多详细信息感兴趣，this is a good resource，但是如果这是您的头脑，请记住，一旦with块结束，您正在使用工作的文件>将自动关闭。

这是代码 - 请记住，您从文件中读取的所有内容都将作为字符，这意味着您必须在对其执行任何类型的数学运算之前适当地转换数字：

result = {}                                        # Create your empty dict

with open('full_megalog.txt', 'r') as file:        # Open your input file

    for line in file:                              # In each line of the file:

        date_ip, num1, num2 = line.split()         # 1:  Get key and 2 values

        if date_ip in result:                      # 2:  Check if key exists

            result[date_ip][0] += int(num1)        # 3a: If yes, add num1, num2
            result[date_ip][1] += int(num2)        #     to current sum.

        else:                                      # 3b: If no, add the new key
            result[date_ip] = int(num1), int(num2) #     and values to the dict

现在您有一个result词典，它将num1和num2的总和与每个对应的date_ip相关联。您可以使用(num1, num2)访问result[date_ip]元组，并且可以使用result[date_ip][0]和result[date_ip][1]分别访问这些值。

如果要以原始格式编写，则必须将每个键和两个值与空格字符连接在一起;这种冗长，易于评论的方式可能是这样的：

with open('condensed_log_file.txt', 'w') as out:       # open the output file;

    for date_ip in result:                             # loop through the keys;

        out.write(                                     # write to the logfile:

                  ' '.join(                            # joined by a space char,
                           (date_ip,                   # the key (date_ip);
                            str(result[date_ip][0]),   # the 1st value (num1);
                            str(result[date_ip][1]))   # & the 2nd value (num2).
                          )

我很想知道piokuc非常整洁干净的方法与我自己的天真方法之间的性能比较。这没有打印和outfile编写语句：

>>> from timeit import timeit
>>> a = open("airthomas.py", "r")
>>> a = a.read()
>>> p = open("piokuc.py", "r")
>>> p = p.read()
>>> timeit(p)
115.33428788593137
>>> timeit(a)
103.95908962552267

因此，如果您需要在大量小文件上运行此操作，使用Counter()可能会稍微慢一些。当然，您可能只需要在一个或几个非常大的文件上运行它 - 在这种情况下，您可以自己进行测试！ ; P

Answer 5

您可以使用词典来解决您的问题，例如：

#assuming that your addresses are stored in a file:
with open('addresses.txt', 'r') as f:
    lines = f.readlines()
    ele = {}

    for line in lines:
        addr = line.split()
        s = [int(addr[1]), int(addr[2])]
        if addr[0] in ele:
            ele[addr[0]][0] += s[0]
            ele[addr[0]][1] += s[1]
        else:
            ele[addr[0]] = s

这会给你：

{'2013.09:142.134.35.17': [14, 16],
 '2013.09:63.151.172.31': [78, 55],
 '2013.10:62.151.172.31': [16, 32]}

Python数组元素的总和

5 个答案: