使用Python对CSV数据进行分组和子分组

时间:2014-04-10 22:11:03

标签: python csv grouping

这是我的CSV格式的示例数据集:

Column[1], Column[2], Account, CostCentre, Rate, Ex VAT,  VAT
000000000, 00000000,  4200213,    G1023,       0, 10.50,  0.0
000000000, 00000000,  4200213,    G1023,      20, 10.50,  2.1
000000000, 00000000,  4200213,    G1023,       0, 10.50,  0.0
000000000, 00000000,  4200213,    G1023,      20, 10.50,  2.1

我正在尝试创建一个专注于帐号的输出文件,并进一步按成本中心和税率对其进行分组,因此,任何帐号为4200213的内容都需要包含在输出中,否则所有其他行都可以忽略。

其次,如果重复成本中心,让我们在这个实例中说G1023,我希望python脚本确定税率是否匹配,如果是,我希望输出文件按照Rate和总结增值税和增值税总成本,以使预期结果如下:

Cost Centre, Rate, Ex VAT, VAT, In VAT

      G1023,    0,     21,   0,    21     
      G1023,   20,     21, 4.2,    25.20

我一直想弄明白但没有任何成功。我目前的代码如下:

import os
import sys
import csv

os.path.dirname = "./"
InputFile_name = "Book1.csv"
InputFile = csv.reader(open(InputFile_name, "r"))
OutputFile_name = "Journal.csv"
OutputFile = open(OutputFile_name, "w")
mydict = []

OutputFile.write("Cost Centre, Tax Rate, Total Ex VAT, VAT, Total In VAT\n")

for line in InputFile:
    if line[2] == "4200213":
        Cost_Centre = line[3]
        Rate = line[4]
        Ex_VAT = line[5]
        VAT = line[6]
        if Cost_Centre in mydict:
            continue
        else:
            mydict.append(Cost_Centre)

        for item in mydict:
            if item in Cost_Centre and Rate == "0":
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT
            elif item in Cost_Centre and Rate == "20":
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT
            OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT+"\n"]))
OutputFile.close()
print "Finished."
sys.exit()

脚本有效,但我距离得到预期的结果太远了。而且你已经发现我对python不是很擅长,所以如果你能对脚本进行修改并给我提供完整的脚本而不仅仅指出错误,那么我将不胜感激。我做错了什么。

3 个答案:

答案 0 :(得分:1)

您可以使用itertools.groupby。我写了这个,不幸的是它不容易阅读。

import csv
import itertools

csvreader = csv.reader(open("Book1.csv", "r"))
lines = [line for line in csvreader]

#Sort
lines =  sorted(lines[1:], key = lambda x: (x[4], x[3], x[2]))

#Grouping
newRows = []
for grp in itertools.groupby(lines, key = lambda x: (x[2], x[3], x[4])):
    newRow = [0, 0] + list(grp[0]) + [0.0, 0.0, 0.0]
    for col in grp[1]:
        newRow[5] += float(col[5])
        newRow[6] += float(col[6])
        newRow[7] += float(col[5]) + float(col[6])
    newRows.append(newRow)

#Filtering and write csv
with open("Journal.csv", "w") as fp:
    csvwriter = csv.writer(fp)
    csvwriter.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])
    for r in filter(lambda x:x[2].strip() == "4200213", newRows):
        csvwriter.writerow(r[3:])

我希望它有所帮助。

答案 1 :(得分:1)

生命太短暂。这就像pandas这样的图书馆擅长的。整个代码:

import pandas as pd
df = pd.read_csv("tax.csv", skipinitialspace=True)
d2 = df.groupby(["CostCentre", "Rate"])[["Ex VAT", "VAT"]].sum()
d2["IN VAT"] = d2["Ex VAT"] + d2["VAT"]
d2.reset_index().to_csv("taxout.csv", index=False)

生成一个新的csv文件,如下所示:

CostCentre,Rate,Ex VAT,VAT,IN VAT
G1023,0,21.0,0.0,21.0
G1023,20,21.0,4.2,25.2

答案 2 :(得分:0)

我在您的代码中添加了一些注释:

import os
import sys # not necessary (see comment below)
import csv

os.path.dirname = "./" # not necessary (current directory is always $PWD)

# I would do:
InputFile = csv.reader(open("Book1.csv", "r"))
OutputFile = open("Journal.csv", "w")
mydict = [] # Okay, but you can also use set() (that's the structure you want in the end)
            # name "mydict" is confusion (it's a list)

OutputFile.write("Cost Centre, Tax Rate, Total Ex VAT, VAT, Total In VAT\n")

for line in InputFile:
    if line[2] == "4200213":
        Cost_Centre = line[3]
        Rate = line[4]
        Ex_VAT = line[5] # you mean float(line[5])
        VAT = line[6]    # you mean float(line[6])
        if Cost_Centre in mydict:
            continue
        else:
            mydict.append(Cost_Centre)

        for item in mydict:
            # Why do you have an if-else statement here? Inside each branch you are doing always the same!
            # Why do not you delete this if else statement?
            if item in Cost_Centre and Rate == "0": # I guess you mean: item == Cost_Centre
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT
            elif item in Cost_Centre and Rate == "20": # I guess you mean: item == Cost_Centre
                Ex_VAT += Ex_VAT
                VAT+= VAT
                In_VAT = Ex_VAT + VAT

            # I would write
            # OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT]) +"\n")
            OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT+"\n"]))

OutputFile.close()

print "Finished."

sys.exit() # not necessary

通常在Python中使用小写名称(请参阅http://legacy.python.org/dev/peps/pep-0008/What is the naming convention in Python for variable and function names?

关于您的问题,您必须先阅读所有行,然后您必须计算并编写最终的CSV。错误在于(例子):

if line[2] == "4200213":
    ...
    Ex_VAT = float(line[5]) # new variable is read

    ...

         Ex_VAT += Ex_VAT # here will always get EX_VAT * 2

更新:这是我的代码:

import csv

from collections import defaultdict
from operator    import add

class vector(tuple):
    def __add__(self, other):
        return vector(other) if len(self) == 0 else vector(map(add, self, other))

mydict = defaultdict(vector)

with open("data.csv", "r") as fd:
    for line in csv.reader(fd):
        line = map(str.strip, line)

        if line[2] == "4200213":
            mydict[line[3], line[4]] += float(line[5]), float(line[6])

with open("journal.csv", "w") as fd:
    writer = csv.writer(fd)
    writer.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])

    for k,v in mydict.iteritems():
        print repr(v)
        writer.writerow(list(k) + list(v) + [sum(v)])

评论:

import csv

from collections import defaultdict # see https://docs.python.org/2/library/collections.html#collections.defaultdict
from operator    import add # add(x,y) == x + y

# for having vector( (1,2,3) ) + vector( (4,5,6) ) = vector( (5,7,9) )
# see https://stackoverflow.com/questions/2576296/using-python-tuples-as-vectors
# and lookup operator overloading for python on the internet
class vector(tuple): 
    def __add__(self, other):
        return vector(other) if len(self) == 0 else vector(map(add, self, other))

# will be in the end
# mydict = {
#     ("G1023","20"): vector((21.0,4.2)),
#     ("G1023","0"): vector((21.0,0.0))
# }
mydict = defaultdict(vector)

# === read csv file ===

with open("data.csv", "r") as fd: # we have not call fd.close() at the end -> very handy ;-) + exception save!
    for line in csv.reader(fd):
        line = map(str.strip, line) # delete whitespaces from all cells

        if line[2] == "4200213":
            mydict[line[3], line[4]] += float(line[5]), float(line[6])

# === write final csv file ===

with open("journal.csv", "w") as fd:
    writer = csv.writer(fd)
    writer.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])

    for k,v in mydict.iteritems():
        writer.writerow(list(k) + list(v) + [sum(v)]) # output each line in the csv

我建议你逐行慢慢阅读上面的代码,直到你明白,一切如何工作(我使用了很酷的python功能)。在Internet上查找您不知道的内容。如果您有任何疑问,请随时在评论中询问我或在Stackoverflow上进行跟进。