解析CSV文件以在Python

时间:2015-11-10 22:01:40

标签: python-2.7 group-by itertools

我有一个~3000行CSV文件,格式为以下列标题:

STORE|BRAND ID|ZONE|SHORT DESCRIPTION (ZONE)|REGION|SHORT DESCRIPTION (REGION)|DISTRICT|SHORT DESCRIPTION (DISTRICT)

以下是该数据的示例:

0010|Company A|0001|East|123|New England|012|Connecticut
0010|Company B|0002|West|456|Coast|010|Oregon
0025|Company A|0001|East|246|South|010|Florida
0010|Company A|0004|West|456|Coast|011|California

文件中有重复的区域,区域和区域。我想要完成的是按照BRAND,REGION和DISTRICT每个BRAND汇总数据:

:区:

0001|Company A|East
0002|Company B|West
0004|Company A|West

:REGIONS:

123|Company A|New England
456|Company B|Coast
246|Company A|South

:各区:

012|Company A|Connecticut
010|Company A|Oregon
010|Company B|Florida
011|Company A|California

下面的代码,利用itertools / groupby,工作正常,并根据我的需要为我提供数据,但是在我读取文件,查找不同区域,再次读取文件的意义上,它是多么令人烦恼的困扰我找到不同的区域,第三次读取文件并找到不同的区域。我认为必须采用更简化的方法来读取此文件并聚合此数据。

with file(myFile, 'r') as f:
        content = f.read()

csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)

with open(myFile) as csvfile:
        reader = csv.DictReader(csvfile, dialect='piper')

        zones_dict = sorted(list(reader), key=itemgetter('ZONE','SHORT DESCRIPTION (ZONE)','BRAND ID'))     

        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect='piper')

        regions_dict = sorted(list(reader), key=itemgetter('REGION', 'SHORT DESCRIPTION (REGION)', 'BRAND ID', 'ZONE'))

        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect='piper')

        districts_dict = sorted(list(reader), key=itemgetter('DISTRICT', 'SHORT DESCRIPTION (DISTRICT)', 'BRAND ID', 'REGION'))

        ###
        #Aggregate Zone#
        ###

        for zone_id, zone_group in groupby(zones_dict, itemgetter('ZONE','SHORT DESCRIPTION (ZONE)','BRAND ID')):
            theZone = zone_id
            print theZone[0]

        print " "

        ###
        #Aggregate Region#
        ###

        for region_id, region_group in groupby(regions_dict, itemgetter('REGION', 'SHORT DESCRIPTION (REGION)', 'BRAND ID', 'ZONE')):
            theRegion = region_id
            print theRegion[0]

        print " "

        ###
        #Aggregate District#
        ###

        for district_id, district_group in groupby(districts_dict, itemgetter('DISTRICT', 'SHORT DESCRIPTION (DISTRICT)', 'BRAND ID', 'REGION')):
            theDistrict = district_id
            print theDistrict[0]

有关更好方法的任何想法吗?

0 个答案:

没有答案