Python + CSV:从CSV列中总结类似值

时间:2016-12-28 13:47:19

标签: python csv

INPUT文件:

$ cat dummy.csv 
OS,A,B,C,D,E
Ubuntu,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Ubuntu,1,1,1,1,0
Windows,0,0,1,1,0
Mac,1,0,1,1,1
Ubuntu,0,1,0,1,1
Ubuntu,0,0,1,1,1
Ubuntu,1,0,1,0,0
Ubuntu,1,1,1,1,0
Mac,0,0,1,1,0
Mac,1,0,1,1,1
Windows,1,1,1,1,0
Ubuntu,0,0,1,1,0
Windows,1,0,1,1,1
Mac,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Windows,1,1,1,1,0
Mac,0,0,1,1,0

预期输出:

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

我使用Excel的数据透视表生成了以上输出。

mycode的:

import csv
import pprint
from collections import defaultdict

d = defaultdict(dict)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d[row['OS']]['A'] += row['A']
        d[row['OS']]['B'] += row['B']
        d[row['OS']]['C'] += row['C']
        d[row['OS']]['D'] += row['D']
        d[row['OS']]['E'] += row['E']

pprint.pprint(d)

错误:

$ python3 dummy.py
Traceback (most recent call last):
  File "dummy.py", line 10, in <module>
    d[row['OS']]['A'] += row['A']
KeyError: 'A'

我的想法是将CSV值累积到字典中,然后打印出来。但是,当我尝试添加值时,我遇到了上述错误。

这似乎可以通过内置的csv模块实现。我认为这是一个更容易的:(任何指针都会有很大的帮助。

7 个答案:

答案 0 :(得分:1)

有两个问题。嵌套字典最初没有设置任何键,因此d[row[OS]]['A']会导致错误。另一个问题是您需要在添加列值之前将其转换为int

您可以在defaultdict中使用Counter作为值,因为缺少密钥默认为0

import csv
from collections import Counter, defaultdict

d = defaultdict(Counter)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)

    for row in reader:
        nested = d[row.pop('OS')]
        for k, v in row.items():
            nested[k] += int(v)

print(*d.items(), sep='\n')

输出:

('Ubuntu', Counter({'D': 6, 'C': 5, 'B': 4, 'E': 3, 'A': 3}))
('Windows', Counter({'C': 6, 'D': 6, 'E': 3, 'A': 3, 'B': 2}))
('Mac', Counter({'C': 6, 'D': 5, 'A': 4, 'E': 3, 'B': 1}))

答案 1 :(得分:1)

这并不能完全回答你的问题,因为使用csv确实可以解决问题,但值得一提的是pandas对于这类事情是完美的:

In [1]: import pandas as pd

In [2]: df = pd.read_csv('dummy.csv')

In [3]: df.groupby('OS').sum()
Out[3]:
         A  B  C  D  E
OS
Mac      4  1  6  5  3
Ubuntu   3  4  5  6  3
Windows  3  2  6  6  3

答案 2 :(得分:1)

这样的东西?您可以将数据帧写入csv文件以获得所需的格式。

import pandas as pd
# df0=pd.read_clipboard(sep=',')
# df0
df=df0.copy()
df=df.groupby(by='OS').sum()
print df

输出:

         A  B  C  D  E
OS                    
Mac      4  1  6  5  3
Ubuntu   3  4  5  6  3
Windows  3  2  6  6  3
df.to_csv('file01')

<强> file01

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

答案 3 :(得分:1)

您遇到了这个例外,因为第一次row['OS']d不存在'A',因此d[row['OS']]中不存在import csv from collections import defaultdict d = defaultdict(dict) with open('dummy.csv') as csvfile: reader = csv.DictReader(csvfile) for row in reader: d[row['OS']]['A'] = d[row['OS']]['A'] + int(row['A']) if (row['OS'] in d and 'A' in d[row['OS']]) else int(row['A']) d[row['OS']]['B'] = d[row['OS']]['B'] + int(row['B']) if (row['OS'] in d and 'B' in d[row['OS']]) else int(row['B']) d[row['OS']]['C'] = d[row['OS']]['C'] + int(row['C']) if (row['OS'] in d and 'C' in d[row['OS']]) else int(row['C']) d[row['OS']]['D'] = d[row['OS']]['D'] + int(row['D']) if (row['OS'] in d and 'D' in d[row['OS']]) else int(row['D']) d[row['OS']]['E'] = d[row['OS']]['E'] + int(row['E']) if (row['OS'] in d and 'E' in d[row['OS']]) else int(row['E']) 。请尝试以下方法来解决此问题:

>>> import pprint
>>>
>>> pprint.pprint(dict(d))
{'Mac': {'A': 4, 'B': 1, 'C': 6, 'D': 5, 'E': 3},
 'Ubuntu': {'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 3},
 'Windows': {'A': 3, 'B': 2, 'C': 6, 'D': 6, 'E': 3}}

<强>输出:

FLOPS C Program (double Precision), V2.0 18 Dec 1992

Module     Error        RunTime      MFLOPS
                        (usec)
 1    -2.5613e-010      0.0034   4177.1562
 2    -1.4166e-013      0.0058   1209.1768
 3     3.1904e-010      0.0011  15487.5445
 4     9.0594e-014      0.0011  14065.9341
 5    -6.2284e-014      0.0034   8652.6807
 6     3.3640e-014      0.0021  13994.3450
 7     9.4360e-012      0.0101   1193.4732
 8     3.7637e-014      0.0022  13677.6492

Iterations      =  512000000
NullTime (usec) =     0.0000
MFLOPS(1)       =  1730.8542
MFLOPS(2)       =  2971.1755
MFLOPS(3)       =  6296.4960
MFLOPS(4)       = 14153.0984

答案 4 :(得分:0)

d是一个字典,因此d[row['OS']]是一个有效的表达式,但d[row['OS']]['A']期望该字典项是某种集合。由于您没有提供默认值,因此它将是None,而不是。{/ p>

答案 5 :(得分:0)

这扩展了niemmi's solution以将输出格式设置为与OP's example相同:

import csv
from collections import Counter, defaultdict

d = defaultdict(Counter)
with open('dummy.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    field_names = reader.fieldnames
    for row in reader:
        counter = d[row.pop('OS')]
        for key, value in row.iteritems():
            counter[key] += int(value)

print ','.join(field_names)
for os, counter in sorted(d.iteritems()):
    print "%s,%s" % (os, ','.join([str(v) for k, v in sorted(counter.iteritems())]))

<强>输出

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

更新:修正了输出。

答案 6 :(得分:0)

我假设您的输入文件名为input_file.csv

您还可以使用groupby模块中的itertoolstwo dicts处理您的数据并获得所需的输出,如下例所示:

from itertools import groupby

data = list(k.strip("\n").split(",") for k in open("input_file.csv", 'r'))

a, b = {}, {}
for k, v in groupby(data[1:], lambda x : x[0]):
    try:
        a[k] += [i[1:] for i in list(v)]
    except KeyError:
        a[k] = [i[1:] for i in list(v)]

for key in a.keys():
    for j in range(5):
        c = 0
        for i in a[key]:
            c += int(i[j])
        try:
            b[key] += ',' + str(c) 
        except KeyError:
            b[key] = str(c)

输出:

print(','.join(data[0]))
for k in b.keys():
    print("{0},{1}".format(k, b[k]))

>>> OS,A,B,C,D,E
>>> Ubuntu,3,4,5,6,3
>>> Windows,3,2,6,6,3
>>> Mac,4,1,6,5,3