这是我的数据:
Year Country Albania Andorra Armenia Austria Azerbaijan
2009 Lithuania 0 0 0 0 1
2009 Israel 0 7 0 0 0
2008 Israel 1 2 2 0 4
2008 Lithuania 1 5 1 0 8
实际上,它是csv文件和分隔符,所以原始数据是:
Year,Country,Albania,Andorra,Armenia,Austria,Azerbaijan
2009,Lithuania,0,0,0,0,1
2009,Israel,0,7,0,0,0
2008,Israel,1,2,2,0,4
2008,Lithuania,1,5,1,0,8
其中列表的第一个元素表示立陶宛的列总和,第二个元素表示以色列的列总和(阿尔巴尼亚列)?
我是python的初学者,并不知道很多python技巧。我所知道的是,我的代码可能太复杂了。
我希望得到这个:
final_dict = {Albania: [1, 1], Andorra: [5, 9], Armenia: [1, 2], Austria: [0, 0], Azerbaijan: [9, 4]}
产出说明:对于第一行中的每个国家(阿尔巴尼亚,安道尔,亚美尼亚,奥地利和阿塞拜疆),我想从国家专栏中获得各国的总和。
Andorra: [5,9]
# 5 is sum for Lithuania in Andorra column
# 9 is sum for Israel in Andorra column
答案 0 :(得分:2)
您可以使用适用于此类应用的the Pandas module:
import pandas as pd
df = pd.read_csv('songfestival.csv')
gb = df.groupby('Country')
res = pd.concat([i[1].sum(numeric_only=True) for i in gb], axis=1).T
res.pop('Year')
order = [i[0] for i in gb]
print(order)
print(res)
#['Israel', 'Lithuania']
# Albania Andorra Armenia Austria Azerbaijan
#0 1 9 2 0 4
#1 1 5 1 0 9
查询您可以执行的每个列的结果:
print(res.Albania)
print(res.Andorra)
...
答案 1 :(得分:1)
好的,所以你想要按年汇总的行:
import csv
from collections import defaultdict
with open("songfestival.csv", "r") as ifile:
reader = csv.DictReader(ifile)
country_columns = [k for k in reader.fieldnames if k not in ["Year","Country"]]
data = defaultdict(lambda:defaultdict(int))
for line in reader:
curr_country = data[line["Country"]]
for country_column in country_columns:
curr_country[country_column] += int(line[country_column])
with open("songfestival_aggr.csv", "w") as ofile:
writer = csv.DictWriter(ofile, fieldnames=country_columns+["Country"])
writer.writeheader()
for k, v in data.items():
row = dict(v)
row["Country"] = k
writer.writerow(row)
我将自由输出到另一个csv文件中。您的数据结构非常容易出错,因为它取决于列的顺序。最好在字典中使用中间字典为聚合分配名称 - >请参阅@ gboffi对您问题的评论。
答案 2 :(得分:0)
您的帽子正在使用the defaultdict from the collections module,请搜索
python defaultdict
在SO上,你会找到很多有用的例子,这是我的答案
import csv
from collections import defaultdict
# slurp the data
data = list(csv.reader(open('points.csv')))
# massage the data
for i, row in enumerate(data[1:],1):
data[i] = [int(elt) if elt.isdigit() else elt for elt in row]
points = {} # an empty dictionary
for i, country in enumerate(data[0][2:],2):
# for each country, a couple country:defaultdict is put in points
points[country] = defaultdict(int)
for row in data[1:]:
opponent = row[1]
points[country][opponent] += row[i]
# here you can post-process points as you like,
# I'll simply print out the stuff
for country in points:
for opponent in points[country]:
print country, "vs", opponent, "scored",
print points[country][opponent], "points."
您的数据的示例输出已
Andorra vs Israel scored 9 points.
Andorra vs Lithuania scored 5 points.
Austria vs Israel scored 0 points.
Austria vs Lithuania scored 0 points.
Albania vs Israel scored 1 points.
Albania vs Lithuania scored 1 points.
Azerbaijan vs Israel scored 4 points.
Azerbaijan vs Lithuania scored 9 points.
Armenia vs Israel scored 2 points.
Armenia vs Lithuania scored 1 points.
修改强>
如果您反对defaultdict
,则可以使用普通.get
的{{1}}方法,如果dict
允许您返回可选的默认值对没有初始化
key:value
如你所见,它有点笨拙,但仍然可以管理。