我有一个CSV文件,我的数据格式如下:
Countries variable 1995 1996 1997 1998 1999
USA GDP 10 11 12 12 13
USA Inf 100 120 130 120 110
USA Trade 200 220 210 235 250
GER GDP 8 9 9.5 10 10.5
GER Inf 100 105 107 109 111
GER Trade 150 156 149 165 167
我打算修改我的数据并将其更改为:
Countries Years GDP Inf Trade
USA 1995 10 100 200
USA 1996 11 120 220
USA 1997 12 130 210
USA 1998 12 120 235
USA 1999 13 110 250
GER 1995 8 100 150
GER 1996 9 105 156
GER 1997 9.5 107 149
GER 1998 10 109 165
GER 1999 10.5 111 167
我不知道我在Python中如何做到这一点。我已经在pandas中导入了我的数据,但是对数据唯一可以做的就是转换不是我想要的列和行。另外,使用csv.writerow
我无法修改我的数据。
答案 0 :(得分:3)
您可以使用Ordereddict对数据进行分组:
import csv
from collections import OrderedDict,defaultdict
from itertools import islice
with open("out.csv") as f:
od = OrderedDict()
r = csv.reader(f, delimiter=" ")
header = next(r)
years = header[2:]
zipped = zip(*r)
countries = OrderedDict.fromkeys(zipped[0]).keys() # next(zipped) python3
it = iter(countries)
for row in zip(*zipped[1:]): # for row in zip(*zipped) python3
if row[0] == "GDP":
key = next(it)
od.setdefault(key, defaultdict(list))
od[key]["Years"] = years
od[key]["Country"] = [key] * len(years)
od[key][row[0]].extend(islice(row,1,None))
输出:
OrderedDict([('USA', defaultdict(<type 'list'>, {'GDP': ['10', '11', '12', '12', '13'], 'Inf': ['100', '120', '130', '120', '110'], 'Years': ['1995', '1996', '1997', '1998', '1999'], 'Trade': ['200', '220', '210', '235', '250']})), ('GER', defaultdict(<type 'list'>, {'GDP': ['8', '9', '9.5', '10', '10.5'], 'Inf': ['100', '105', '107', '109', '111'], 'Years': ['1995', '1996', '1997', '1998', '1999'], 'Trade': ['150', '156', '149', '165', '167']}))])
在熊猫中更有经验的人将能够更好地做到这一点,但这至少会创建一个数据帧:
df = pd.DataFrame(columns=["Country","Years","GDP","Inf","Trade"])
for k,v in od.items():
df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ),["Country","Years","GDP","Inf","Trade"]).transpose()
f = df.append(df_temp,ignore_index=True)
print(df)
输出:
Country Years GDP Inf Trade
0 USA 1995 10 100 200
1 USA 1996 11 120 220
2 USA 1997 12 130 210
3 USA 1998 12 120 235
4 USA 1999 13 110 250
5 GER 1995 8 100 150
6 GER 1996 9 105 156
7 GER 1997 9.5 107 149
8 GER 1998 10 109 165
9 GER 1999 10.5 111 167
如果你的文件较大,你也可以随时创建数据框并每次重置OrderedDict以避免在dict中存储所有数据,你只需要在主代码之外追加最后一组,我们也可以如果使用python2,使用itertools.islice获取所有切片并使用itertools.izip进行压缩:
import csv
from collections import OrderedDict,defaultdict
from itertools import islice,izip
df = pd.DataFrame(columns=["Country","Years","GDP","Inf","Trade"])
with open("out.csv") as f:
od = OrderedDict()
r = csv.reader(f, delimiter=" ")
header = next(r)
years = header[2:]
zipped = izip(*r)
countries = OrderedDict.fromkeys(next(zipped)).keys()
it = iter(countries)
for row in izip(*zipped):
if row[0] == "GDP":
if od: # make sure it is not the first line
for k, v in od.items():
df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ), ["Country","Years","GDP","Inf","Trade"]).transpose()
df = df.append(df_temp, ignore_index=True)
od = OrderedDict()
key = next(it)
od.setdefault(key, defaultdict(list))
od[key]["Years"] = years
od[key]["Country"] = [key] * len(years)
od[key][row[0]].extend(islice(row, 1, None))
for k,v in od.items():
df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ), ["Country","Years","GDP","Inf","Trade"]).transpose()
df = df.append(df_temp, ignore_index=True)
print(df)
哪个应该再次给出相同的输出:
Country Years GDP Inf Trade
0 USA 1995 10 100 200
1 USA 1996 11 120 220
2 USA 1997 12 130 210
3 USA 1998 12 120 235
4 USA 1999 13 110 250
5 GER 1995 8 100 150
6 GER 1996 9 105 156
7 GER 1997 9.5 107 149
8 GER 1998 10 109 165
9 GER 1999 10.5 111 167
答案 1 :(得分:2)
假设您将数据放在列表列表中:
>>> for line in data:
... print('\t'.join(line))
...
USA GDP 10 11 12 12 13
USA Inf 100 120 130 120 110
USA Trade 200 220 210 235 250
GER GDP 8 9 9.5 10 10.5
GER Inf 100 105 107 109 111
GER Trade 150 156 149 165 167
使用以下代码:
from collections import defaultdict
data2 = defaultdict(dict)
for line in data:
for i, year in ((2,1995),(3,1996),(4,1997),(5,1998),(6,1999)):
data2[(line[0], year)][line[1]] = line[i]
data3 = [[i,j]+[data2[(i,j)][k] for k in ('GDP','Inf','Trade')] for i,j in data2]
for line in sorted(data3):
print(line)
你得到:
['GER', 1995, '8', '100', '150']
['GER', 1996, '9', '105', '156']
['GER', 1997, '9.5', '107', '149']
['GER', 1998, '10', '109', '165']
['GER', 1999, '10.5', '111', '167']
['USA', 1995, '10', '100', '200']
['USA', 1996, '11', '120', '220']
['USA', 1997, '12', '130', '210']
['USA', 1998, '12', '120', '235']
['USA', 1999, '13', '110', '250']
答案 2 :(得分:2)
这个答案很像@ AmiTavory删除的答案(使用unstack
代替pivot_table
,但它们在这里相当),最后还有一步:
df2 = pd.melt(df, id_vars=["Countries", "variable"], var_name="Years")
df2 = df2.set_index(["Countries", "Years", "variable"]).unstack().reset_index()
df2.columns = [x[1] if x[1] else x[0] for x in df2.columns]
产生
In [149]: df2
Out[149]:
Countries Years GDP Inf Trade
0 GER 1995 8.0 100 150
1 GER 1996 9.0 105 156
2 GER 1997 9.5 107 149
3 GER 1998 10.0 109 165
4 GER 1999 10.5 111 167
5 USA 1995 10.0 100 200
6 USA 1996 11.0 120 220
7 USA 1997 12.0 130 210
8 USA 1998 12.0 120 235
9 USA 1999 13.0 110 250
这很有效,因为首先我们创建了一个融合版本的框架:
In [160]: df2 = pd.melt(df, id_vars=["Countries", "variable"], var_name="Years")
In [161]: df2
Out[161]:
Countries variable Years value
0 USA GDP 1995 10.0
1 USA Inf 1995 100.0
2 USA Trade 1995 200.0
3 GER GDP 1995 8.0
4 GER Inf 1995 100.0
5 GER Trade 1995 150.0
6 USA GDP 1996 11.0
[...]
然后我们设置索引并取消堆栈:
In [166]: df2 = df2.set_index(["Countries", "Years", "variable"]).unstack().reset_index()
In [167]: df2
Out[167]:
Countries Years value
variable GDP Inf Trade
0 GER 1995 8.0 100 150
1 GER 1996 9.0 105 156
2 GER 1997 9.5 107 149
3 GER 1998 10.0 109 165
4 GER 1999 10.5 111 167
5 USA 1995 10.0 100 200
6 USA 1996 11.0 120 220
7 USA 1997 12.0 130 210
8 USA 1998 12.0 120 235
9 USA 1999 13.0 110 250
这几乎是我们想要的,但是列太复杂了。我们可以解决它:
In [168]: df2.columns
Out[168]:
MultiIndex(levels=[['value', 'Years', 'Countries'], ['GDP', 'Inf', 'Trade', '']],
labels=[[2, 1, 0, 0, 0], [3, 3, 0, 1, 2]],
names=[None, 'variable'])
In [169]: df2.columns = [x[1] if x[1] else x[0] for x in df2.columns]
In [170]: df2
Out[170]:
Countries Years GDP Inf Trade
0 GER 1995 8.0 100 150
1 GER 1996 9.0 105 156
2 GER 1997 9.5 107 149
3 GER 1998 10.0 109 165
4 GER 1999 10.5 111 167
5 USA 1995 10.0 100 200
6 USA 1996 11.0 120 220
7 USA 1997 12.0 130 210
8 USA 1998 12.0 120 235
9 USA 1999 13.0 110 250
答案 3 :(得分:1)
我将您的数据复制粘贴到电子表格中。也许重命名不是必需的,但我认为列名variable
产生了错误。另外,我没有检查这是否是RAM消耗最少的方法。
import pandas as pd
import numpy as np
df = pd.read_excel('df_countries.xls','Sheet1')
df.columns=['countries','var','1995','1996','1997','1998','1999']
df_new = pd.melt(df,id_vars=['countries','var'])
df_new.columns = ['countries','var','year','data']
df_new.set_index(['countries','year','var']).unstack('var')