使用Python3标准库重写熊猫脚本

时间:2018-10-10 16:20:10

标签: python python-3.x pandas

这是用熊猫编写的脚本,我必须使用标准库进行重写:

import pandas as pd
import sys

if __name__ == '__main__':
    if len(sys.argv) != 1 :
        print('usage: python by_continent.py')
        sys.exit(1)

    gap = pd.read_csv('gapminder.tsv', sep='\t')
    means = gap.groupby('continent').mean()
    parts = means[['lifeExp', 'gdpPercap']]
    print( parts )

输入看起来像:

country continent   year    lifeExp pop gdpPercap
Zambia  Africa  2002    39.193  10595811    1071.613938 
Zambia  Africa  2007    42.384  11746035    1271.211593 
Zimbabwe    Africa  1952    48.451  3080907 406.8841148 
Zimbabwe    Africa  1957    50.469  3646340 518.7642681 
Zimbabwe    Africa  1962    52.358  4277736 527.2721818 
Zimbabwe    Africa  1967    53.995  4995432 569.7950712 
Zimbabwe    Africa  1972    55.635  5861135 799.3621758 
Zimbabwe    Africa  1977    57.674  6642107 685.5876821 
Zimbabwe    Africa  1982    60.363  7636524 788.8550411 
Zimbabwe    Africa  1987    62.351  9216418 706.1573059 
Zimbabwe    Africa  1992    60.377  10704340    693.4207856 
Zimbabwe    Africa  1997    46.809  11404948    792.4499603 
Zimbabwe    Africa  2002    39.989  11926563    672.0386227 
Zimbabwe    Africa  2007    43.487  12311143    469.7092981 
Argentina   Americas    1952    62.485  17876956    5911.315053 
Argentina   Americas    1957    64.399  19610538    6856.856212 
Argentina   Americas    1962    65.142  21283783    7133.166023 
Argentina   Americas    1967    65.634  22934225    8052.953021 
Argentina   Americas    1972    67.065  24779799    9443.038526 
Argentina   Americas    1977    68.481  26983828    10079.02674 

以下是输出内容:

             lifeExp     gdpPercap
continent
Africa     48.865330   2193.754578
Americas   64.658737   7136.110356
Asia       60.064903   7902.150428
Europe     71.903686  14469.475533
Oceania    74.326208  18621.609223

我被困住了。我可以用csv模块来解析代码,但不能进一步。这是我的代码:

import sys 
import csv

with open('gapminder.tsv', 'r') as gap:
    csv_reader = csv.reader(gap, delimiter="\t")
    lst = list(csv_reader)

    for row in lst: 
        if row [1] == 'Africa': 
            pop = []
            pop.append(row[4])

3 个答案:

答案 0 :(得分:0)

对于groupby,您可以只使用字典,以编程方式用键填充键,值以null开始,以3元组列表结尾;它会具有列名,计算类型以及您应该为给定组合返回的值。

也许有更漂亮,更优雅的方法,但这应该可以工作。

编辑:等一下,因为您已经指定了列(我的意思是大洲)名称,所以它们将是2元组。我想我还没睡着。

您还可以仅创建一个包含所有相关值的n维列表(矩阵)。那将是明智的方法。

答案 1 :(得分:0)

以下内容将帮助您入门(请原谅惰性变量命名)。主要思想是利用itertools中现有的groupby来按您想要的任何字段进行汇总,然后从这些分组中将结果收集到字典中,从而得出您想要平均的相关字段。

P.S。 -希望这不是您的作业,因为那只是懒惰:)

stuff = """country continent   year    lifeExp pop gdpPercap
Zambia  Africa  2002    39.193  10595811    1071.613938 
Zambia  Africa  2007    42.384  11746035    1271.211593 
Zimbabwe    Africa  1952    48.451  3080907 406.8841148 
Zimbabwe    Africa  1957    50.469  3646340 518.7642681 
Zimbabwe    Africa  1962    52.358  4277736 527.2721818 
Zimbabwe    Africa  1967    53.995  4995432 569.7950712 
Zimbabwe    Africa  1972    55.635  5861135 799.3621758 
Zimbabwe    Africa  1977    57.674  6642107 685.5876821 
Zimbabwe    Africa  1982    60.363  7636524 788.8550411 
Zimbabwe    Africa  1987    62.351  9216418 706.1573059 
Zimbabwe    Africa  1992    60.377  10704340    693.4207856 
Zimbabwe    Africa  1997    46.809  11404948    792.4499603 
Zimbabwe    Africa  2002    39.989  11926563    672.0386227 
Zimbabwe    Africa  2007    43.487  12311143    469.7092981 
Argentina   Americas    1952    62.485  17876956    5911.315053 
Argentina   Americas    1957    64.399  19610538    6856.856212 
Argentina   Americas    1962    65.142  21283783    7133.166023 
Argentina   Americas    1967    65.634  22934225    8052.953021 
Argentina   Americas    1972    67.065  24779799    9443.038526 
Argentina   Americas    1977    68.481  26983828    10079.02674""".strip().splitlines()

import itertools
from collections import defaultdict

stuff = [line.split() for line in stuff]
headers, *records = stuff
labeled_records = [dict(zip(headers,line)) for line in records]

#group by continent
grouped = itertools.groupby(labeled_records,lambda x: x['continent'])
results = defaultdict(list)

# for other categories, just change 'lifeExp' to the column you want 
for k,v in grouped:
    for d in v:
        results[k].append(float(d['lifeExp']))

# average collected results     
for k,v in results.items():
    print(k,'\t',sum(v)/len(v))

答案 2 :(得分:0)

我不想破坏计算解决方案的乐趣,但是您可以将其转换为字典列表,这将使数据更易于管理。为此,您需要拆分所有子列表,我们将在第1行中用其自己的列表压缩每个子列表,这些将是我们的键,然后我们可以在这些子列表中压缩这些键。之后,我们使用dict()构造函数创建字典列表。

使用此功能,您可以执行for i in res: print(i['country']等操作,从列表中提取仅属于某个国家/地区的字典,依此类推。

import csv
import pprint

with open('gapminder.tsv', 'r') as gap:
    csv_reader = csv.reader(gap, delimiter="\t")
    lst = list(csv_reader)

lst = [i[0].split() for i in lst]
prep = zip([lst[0]]*len(lst[1:]), lst[1:])
prep = [(zip(i[0], i[1])) for i in prep]
res = [dict([j for j in i]) for i in prep]
pprint.pprint(res)
[{'continent': 'Africa',
  'country': 'Zambia',
  'gdpPercap': '1071.613938',
  'lifeExp': '39.193',
  'pop': '10595811',
  'year': '2002'},
 {'continent': 'Africa',
  'country': 'Zambia',
  'gdpPercap': '1271.211593',
  'lifeExp': '42.384',
  'pop': '11746035',
  'year': '2007'},
...