Python分析csv文件

时间:2017-04-01 01:19:12

标签: python csv

我试图在1992年找到三个人口最多的城市部分(比亚迪)

我有一个csv文件,如下所示:http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec/download/befkbhalderkoencivst.csv>

csv文件可以解释为:

AAR :观察的哪一年

BYDEL :城市的哪个部分,由下面的dict中包含的整数描述; 1 = Indre By,2 =Østerbro,3 =Nørrebro,4 = Vesterbro / Kgs。 Enghave,5 = Valby,6 =Vanløse,7 =Brønshøj-Husum,8 = Bispebjerg,9 =AmagerØst,10 = Amager Vest,99 = Udenfor inddeling

ALDER :被观察者的年龄

PERSONER :具有行的给定功能的观察数量

我有一个解决方案,但它非常重复,我认为它可以更聪明地完成,但我没有足够的python经验。有人能指出我正确的方向吗?

我的代码/解决方案如下所示:

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
Q31 = collections.defaultdict(list)
Q32 = collections.defaultdict(list)
Q33 = collections.defaultdict(list)
Q34 = collections.defaultdict(list)
Q35 = collections.defaultdict(list)
Q36 = collections.defaultdict(list)
Q37 = collections.defaultdict(list)
Q38 = collections.defaultdict(list)
Q39 = collections.defaultdict(list)
Q310 = collections.defaultdict(list)
Q399 = collections.defaultdict(list)
for row in data:
    key = row[0]
    if key == "" or key == 0: continue
    if key == 1992:
        if row[2] == 1:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q31.setdefault(key,[]).append(val)
        if row[2] == 2:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q32.setdefault(key,[]).append(val)
        if row[2] == 3:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q33.setdefault(key,[]).append(val)
        if row[2] == 4:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q34.setdefault(key,[]).append(val)
        if row[2] == 5:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q35.setdefault(key,[]).append(val)
        if row[2] == 6:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q36.setdefault(key,[]).append(val)
        if row[2] == 7:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q37.setdefault(key,[]).append(val)
        if row[2] == 8:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q38.setdefault(key,[]).append(val)
        if row[2] == 9:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q39.setdefault(key,[]).append(val)
        if row[2] == 10:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q310.setdefault(key,[]).append(val)
        if row[2] == 99:
            val = 0 if(row[5]) ==""  else float(row[5])
            Q399.setdefault(key,[]).append(val)

Q312 = {}
for k, v in Q31.items(): Q312[k] = sum(v)
for k, v in Q312.items(): print ("{}:{}".format(k,v))
Q322 = {}
for k, v in Q32.items(): Q322[k] = sum(v)
for k, v in Q322.items(): print ("{}:{}".format(k,v))
Q332 = {}
for k, v in Q33.items(): Q332[k] = sum(v)
for k, v in Q332.items(): print ("{}:{}".format(k,v))
Q342 = {}
for k, v in Q34.items(): Q342[k] = sum(v)
for k, v in Q342.items(): print ("{}:{}".format(k,v))
Q352 = {}
for k, v in Q35.items(): Q352[k] = sum(v)
for k, v in Q352.items(): print ("{}:{}".format(k,v))
Q362 = {}
for k, v in Q36.items(): Q362[k] = sum(v)
for k, v in Q362.items(): print ("{}:{}".format(k,v))
Q372 = {}
for k, v in Q37.items(): Q372[k] = sum(v)
for k, v in Q372.items(): print ("{}:{}".format(k,v))
Q382 = {}
for k, v in Q38.items(): Q382[k] = sum(v)
for k, v in Q382.items(): print ("{}:{}".format(k,v))
Q392 = {}
for k, v in Q39.items(): Q392[k] = sum(v)
for k, v in Q392.items(): print ("{}:{}".format(k,v))
Q3102 = {}
for k, v in Q310.items(): Q3102[k] = sum(v)
for k, v in Q3102.items(): print ("{}:{}".format(k,v))
Q3992 = {}
for k, v in Q399.items(): Q3992[k] = sum(v)
for k, v in Q3992.items(): print ("{}:{}".format(k,v))

2 个答案:

答案 0 :(得分:5)

这实际上是一个非常好的迹象,表明您已经认识到必须有一个更简单的方法!每当你发现自己违反DRY原则(不要重复自己)时,你应该问你是否犯了一个失误。

虽然您只需使用字典字典而不是所有这些命名变量就可以删除大量复制,但由于您使用了pandas,我会利用(let ([x "apple"]) ;; in in^ bindings come from here! (define-values/invoke-unit A@ (import in^) (export out^)) ;; the out^ exports are available for the rest of the let body (foo "orange")) 和{{1}相反,它给了我:

groupby

首先,我们对AAR和BYDEL列进行分组,并在每个组中,我们获取PERSONER值并对它们求和。这为我们提供了一个开始的框架

nlargest

然后我们选择AAR == 1992的行,以及具有3个最大PERSONER值的行。

如果您要进行此类数据处理,我强烈 强烈建议您通过pandas tutorial阅读,否则您会发现自己重新发明车轮。

答案 1 :(得分:1)

更加pythonic的解决方案将使用字典而不是许多(大多数)命名变量。您还将setdefaultdefaultdict个实例一起使用 - 任何一个都是不错的选择,但两者都不是必需的。

我的替代版本(不使用pandas,因为@DSM涵盖的很好):

df = pd.read_csv('befkbh.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = df.as_matrix()
areas = { k : collections.defaultdict(list) for k in range(1,11) }
areas[99] = collections.defaultdict(list)

for row in data:
    key = row[0]
    if key == 1992 and row[1] in areas:
       areas[row[1]][key].append(0 if(row[5]) ==""  else float(row[5]))

for area in sorted(areas):
    for k, v in areas[area].items():
        print ("{}:{}".format(k, sum(v)))

我假设问题row[2]应该是row[1],因为BYDEL是第二列,而不是第三列。

为了逐年获得前三名,我的组织方式略有不同,外部的词典是年份,而不是区域。

该版本如下所示:

years = collections.defaultdict(lambda : collections.defaultdict(list))

for row in data:
    years[row[0]][row[1]].append(0 if(row[5]) ==""  else float(row[5]))

for year in sorted(years):
    for n, area in sorted((sum(v), k) for k, v in years[year].items())[:-4:-1]:
        print ("{} {:4} {:9}".format(year, area, n))