这是用熊猫编写的脚本,我必须使用标准库进行重写:
import pandas as pd
import sys
if __name__ == '__main__':
if len(sys.argv) != 1 :
print('usage: python by_continent.py')
sys.exit(1)
gap = pd.read_csv('gapminder.tsv', sep='\t')
means = gap.groupby('continent').mean()
parts = means[['lifeExp', 'gdpPercap']]
print( parts )
输入看起来像:
country continent year lifeExp pop gdpPercap
Zambia Africa 2002 39.193 10595811 1071.613938
Zambia Africa 2007 42.384 11746035 1271.211593
Zimbabwe Africa 1952 48.451 3080907 406.8841148
Zimbabwe Africa 1957 50.469 3646340 518.7642681
Zimbabwe Africa 1962 52.358 4277736 527.2721818
Zimbabwe Africa 1967 53.995 4995432 569.7950712
Zimbabwe Africa 1972 55.635 5861135 799.3621758
Zimbabwe Africa 1977 57.674 6642107 685.5876821
Zimbabwe Africa 1982 60.363 7636524 788.8550411
Zimbabwe Africa 1987 62.351 9216418 706.1573059
Zimbabwe Africa 1992 60.377 10704340 693.4207856
Zimbabwe Africa 1997 46.809 11404948 792.4499603
Zimbabwe Africa 2002 39.989 11926563 672.0386227
Zimbabwe Africa 2007 43.487 12311143 469.7092981
Argentina Americas 1952 62.485 17876956 5911.315053
Argentina Americas 1957 64.399 19610538 6856.856212
Argentina Americas 1962 65.142 21283783 7133.166023
Argentina Americas 1967 65.634 22934225 8052.953021
Argentina Americas 1972 67.065 24779799 9443.038526
Argentina Americas 1977 68.481 26983828 10079.02674
以下是输出内容:
lifeExp gdpPercap
continent
Africa 48.865330 2193.754578
Americas 64.658737 7136.110356
Asia 60.064903 7902.150428
Europe 71.903686 14469.475533
Oceania 74.326208 18621.609223
我被困住了。我可以用csv模块来解析代码,但不能进一步。这是我的代码:
import sys
import csv
with open('gapminder.tsv', 'r') as gap:
csv_reader = csv.reader(gap, delimiter="\t")
lst = list(csv_reader)
for row in lst:
if row [1] == 'Africa':
pop = []
pop.append(row[4])
答案 0 :(得分:0)
对于groupby,您可以只使用字典,以编程方式用键填充键,值以null开始,以3元组列表结尾;它会具有列名,计算类型以及您应该为给定组合返回的值。
也许有更漂亮,更优雅的方法,但这应该可以工作。
编辑:等一下,因为您已经指定了列(我的意思是大洲)名称,所以它们将是2元组。我想我还没睡着。
您还可以仅创建一个包含所有相关值的n维列表(矩阵)。那将是明智的方法。
答案 1 :(得分:0)
以下内容将帮助您入门(请原谅惰性变量命名)。主要思想是利用itertools中现有的groupby
来按您想要的任何字段进行汇总,然后从这些分组中将结果收集到字典中,从而得出您想要平均的相关字段。
P.S。 -希望这不是您的作业,因为那只是懒惰:)
stuff = """country continent year lifeExp pop gdpPercap
Zambia Africa 2002 39.193 10595811 1071.613938
Zambia Africa 2007 42.384 11746035 1271.211593
Zimbabwe Africa 1952 48.451 3080907 406.8841148
Zimbabwe Africa 1957 50.469 3646340 518.7642681
Zimbabwe Africa 1962 52.358 4277736 527.2721818
Zimbabwe Africa 1967 53.995 4995432 569.7950712
Zimbabwe Africa 1972 55.635 5861135 799.3621758
Zimbabwe Africa 1977 57.674 6642107 685.5876821
Zimbabwe Africa 1982 60.363 7636524 788.8550411
Zimbabwe Africa 1987 62.351 9216418 706.1573059
Zimbabwe Africa 1992 60.377 10704340 693.4207856
Zimbabwe Africa 1997 46.809 11404948 792.4499603
Zimbabwe Africa 2002 39.989 11926563 672.0386227
Zimbabwe Africa 2007 43.487 12311143 469.7092981
Argentina Americas 1952 62.485 17876956 5911.315053
Argentina Americas 1957 64.399 19610538 6856.856212
Argentina Americas 1962 65.142 21283783 7133.166023
Argentina Americas 1967 65.634 22934225 8052.953021
Argentina Americas 1972 67.065 24779799 9443.038526
Argentina Americas 1977 68.481 26983828 10079.02674""".strip().splitlines()
import itertools
from collections import defaultdict
stuff = [line.split() for line in stuff]
headers, *records = stuff
labeled_records = [dict(zip(headers,line)) for line in records]
#group by continent
grouped = itertools.groupby(labeled_records,lambda x: x['continent'])
results = defaultdict(list)
# for other categories, just change 'lifeExp' to the column you want
for k,v in grouped:
for d in v:
results[k].append(float(d['lifeExp']))
# average collected results
for k,v in results.items():
print(k,'\t',sum(v)/len(v))
答案 2 :(得分:0)
我不想破坏计算解决方案的乐趣,但是您可以将其转换为字典列表,这将使数据更易于管理。为此,您需要拆分所有子列表,我们将在第1行中用其自己的列表压缩每个子列表,这些将是我们的键,然后我们可以在这些子列表中压缩这些键。之后,我们使用dict()
构造函数创建字典列表。
使用此功能,您可以执行for i in res: print(i['country']
等操作,从列表中提取仅属于某个国家/地区的字典,依此类推。
import csv
import pprint
with open('gapminder.tsv', 'r') as gap:
csv_reader = csv.reader(gap, delimiter="\t")
lst = list(csv_reader)
lst = [i[0].split() for i in lst]
prep = zip([lst[0]]*len(lst[1:]), lst[1:])
prep = [(zip(i[0], i[1])) for i in prep]
res = [dict([j for j in i]) for i in prep]
pprint.pprint(res)
[{'continent': 'Africa', 'country': 'Zambia', 'gdpPercap': '1071.613938', 'lifeExp': '39.193', 'pop': '10595811', 'year': '2002'}, {'continent': 'Africa', 'country': 'Zambia', 'gdpPercap': '1271.211593', 'lifeExp': '42.384', 'pop': '11746035', 'year': '2007'}, ...