如何修改CSV文件中的数据并更改行和列?

时间:2015-06-17 17:51:20

标签: python csv pandas

我有一个CSV文件,我的数据格式如下:

Countries  variable  1995   1996  1997  1998   1999
  USA        GDP      10     11    12    12     13 
  USA        Inf      100    120   130   120    110
  USA        Trade    200    220   210   235    250
  GER        GDP      8      9     9.5   10     10.5
  GER        Inf      100    105   107   109    111 
  GER        Trade    150    156   149   165    167 

我打算修改我的数据并将其更改为:

  Countries  Years    GDP    Inf   Trade
     USA      1995     10    100    200
     USA      1996     11    120    220
     USA      1997     12    130    210
     USA      1998     12    120    235
     USA      1999     13    110    250
     GER      1995     8     100    150
     GER      1996     9     105    156
     GER      1997     9.5   107    149
     GER      1998     10    109    165
     GER      1999     10.5  111    167

我不知道我在Python中如何做到这一点。我已经在pandas中导入了我的数据,但是对数据唯一可以做的就是转换不是我想要的列和行。另外,使用csv.writerow我无法修改我的数据。

4 个答案:

答案 0 :(得分:3)

您可以使用Ordereddict对数据进行分组:

import csv
from collections import OrderedDict,defaultdict
from itertools import islice
with open("out.csv") as f:
    od = OrderedDict()
    r = csv.reader(f, delimiter=" ")
    header = next(r)
    years = header[2:]
    zipped = zip(*r)
    countries = OrderedDict.fromkeys(zipped[0]).keys() # next(zipped) python3
    it = iter(countries)
    for  row in zip(*zipped[1:]): #  for  row in zip(*zipped) python3
        if row[0] == "GDP":
            key = next(it)
            od.setdefault(key, defaultdict(list))
            od[key]["Years"] = years
            od[key]["Country"] = [key] * len(years)
        od[key][row[0]].extend(islice(row,1,None))

输出:

OrderedDict([('USA', defaultdict(<type 'list'>, {'GDP': ['10', '11', '12', '12', '13'], 'Inf': ['100', '120', '130', '120', '110'], 'Years': ['1995', '1996', '1997', '1998', '1999'], 'Trade': ['200', '220', '210', '235', '250']})), ('GER', defaultdict(<type 'list'>, {'GDP': ['8', '9', '9.5', '10', '10.5'], 'Inf': ['100', '105', '107', '109', '111'], 'Years': ['1995', '1996', '1997', '1998', '1999'], 'Trade': ['150', '156', '149', '165', '167']}))])

在熊猫中更有经验的人将能够更好地做到这一点,但这至少会创建一个数据帧:

df = pd.DataFrame(columns=["Country","Years","GDP","Inf","Trade"])

for k,v in od.items():
    df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ),["Country","Years","GDP","Inf","Trade"]).transpose()
    f = df.append(df_temp,ignore_index=True)
print(df)

输出:

    Country Years   GDP  Inf Trade
0     USA  1995    10  100   200
1     USA  1996    11  120   220
2     USA  1997    12  130   210
3     USA  1998    12  120   235
4     USA  1999    13  110   250
5     GER  1995     8  100   150
6     GER  1996     9  105   156
7     GER  1997   9.5  107   149
8     GER  1998    10  109   165
9     GER  1999  10.5  111   167

如果你的文件较大,你也可以随时创建数据框并每次重置OrderedDict以避免在dict中存储所有数据,你只需要在主代码之外追加最后一组,我们也可以如果使用python2,使用itertools.islice获取所有切片并使用itertools.izip进行压缩:

import csv
from collections import OrderedDict,defaultdict
from itertools import islice,izip

df = pd.DataFrame(columns=["Country","Years","GDP","Inf","Trade"])
with open("out.csv") as f:
    od = OrderedDict()
    r = csv.reader(f, delimiter=" ")
    header = next(r)
    years = header[2:]
    zipped = izip(*r)
    countries = OrderedDict.fromkeys(next(zipped)).keys()
    it = iter(countries)
    for row in izip(*zipped):
        if row[0] == "GDP":
            if od: # make sure it is not the first line
                for k, v in od.items():
                    df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ), ["Country","Years","GDP","Inf","Trade"]).transpose()
                    df = df.append(df_temp, ignore_index=True)
                    od = OrderedDict()
            key = next(it)
            od.setdefault(key, defaultdict(list))
            od[key]["Years"] = years
            od[key]["Country"] = [key] * len(years)
        od[key][row[0]].extend(islice(row, 1, None))

for k,v in od.items():
        df_temp = pd.DataFrame((v[k] for k in ["Country","Years","GDP","Inf","Trade"] ), ["Country","Years","GDP","Inf","Trade"]).transpose()
        df = df.append(df_temp, ignore_index=True)


print(df)

哪个应该再次给出相同的输出:

  Country Years   GDP  Inf Trade
0     USA  1995    10  100   200
1     USA  1996    11  120   220
2     USA  1997    12  130   210
3     USA  1998    12  120   235
4     USA  1999    13  110   250
5     GER  1995     8  100   150
6     GER  1996     9  105   156
7     GER  1997   9.5  107   149
8     GER  1998    10  109   165
9     GER  1999  10.5  111   167

答案 1 :(得分:2)

假设您将数据放在列表列表中:

>>> for line in data:
...     print('\t'.join(line))
...
USA  GDP    10   11   12   12   13
USA  Inf    100  120  130  120  110
USA  Trade  200  220  210  235  250
GER  GDP    8    9    9.5  10   10.5
GER  Inf    100  105  107  109  111
GER  Trade  150  156  149  165  167

使用以下代码:

from collections import defaultdict
data2 = defaultdict(dict)
for line in data:
    for i, year in ((2,1995),(3,1996),(4,1997),(5,1998),(6,1999)):
        data2[(line[0], year)][line[1]] = line[i]
data3 = [[i,j]+[data2[(i,j)][k] for k in ('GDP','Inf','Trade')] for i,j in data2]
for line in sorted(data3):
    print(line)

你得到:

['GER', 1995, '8', '100', '150']
['GER', 1996, '9', '105', '156']
['GER', 1997, '9.5', '107', '149']
['GER', 1998, '10', '109', '165']
['GER', 1999, '10.5', '111', '167']
['USA', 1995, '10', '100', '200']
['USA', 1996, '11', '120', '220']
['USA', 1997, '12', '130', '210']
['USA', 1998, '12', '120', '235']
['USA', 1999, '13', '110', '250']

答案 2 :(得分:2)

这个答案很像@ AmiTavory删除的答案(使用unstack代替pivot_table,但它们在这里相当),最后还有一步:

df2 = pd.melt(df, id_vars=["Countries", "variable"], var_name="Years")
df2 = df2.set_index(["Countries", "Years", "variable"]).unstack().reset_index()
df2.columns = [x[1] if x[1] else x[0] for x in df2.columns]

产生

In [149]: df2
Out[149]: 
  Countries Years   GDP  Inf  Trade
0       GER  1995   8.0  100    150
1       GER  1996   9.0  105    156
2       GER  1997   9.5  107    149
3       GER  1998  10.0  109    165
4       GER  1999  10.5  111    167
5       USA  1995  10.0  100    200
6       USA  1996  11.0  120    220
7       USA  1997  12.0  130    210
8       USA  1998  12.0  120    235
9       USA  1999  13.0  110    250

这很有效,因为首先我们创建了一个融合版本的框架:

In [160]: df2 = pd.melt(df, id_vars=["Countries", "variable"], var_name="Years")

In [161]: df2
Out[161]: 
   Countries variable Years  value
0        USA      GDP  1995   10.0
1        USA      Inf  1995  100.0
2        USA    Trade  1995  200.0
3        GER      GDP  1995    8.0
4        GER      Inf  1995  100.0
5        GER    Trade  1995  150.0
6        USA      GDP  1996   11.0
[...]

然后我们设置索引并取消堆栈:

In [166]: df2 = df2.set_index(["Countries", "Years", "variable"]).unstack().reset_index()

In [167]: df2
Out[167]: 
         Countries Years value           
variable                   GDP  Inf Trade
0              GER  1995   8.0  100   150
1              GER  1996   9.0  105   156
2              GER  1997   9.5  107   149
3              GER  1998  10.0  109   165
4              GER  1999  10.5  111   167
5              USA  1995  10.0  100   200
6              USA  1996  11.0  120   220
7              USA  1997  12.0  130   210
8              USA  1998  12.0  120   235
9              USA  1999  13.0  110   250

这几乎是我们想要的,但是列太复杂了。我们可以解决它:

In [168]: df2.columns
Out[168]: 
MultiIndex(levels=[['value', 'Years', 'Countries'], ['GDP', 'Inf', 'Trade', '']],
           labels=[[2, 1, 0, 0, 0], [3, 3, 0, 1, 2]],
           names=[None, 'variable'])

In [169]: df2.columns = [x[1] if x[1] else x[0] for x in df2.columns]

In [170]: df2
Out[170]: 
  Countries Years   GDP  Inf  Trade
0       GER  1995   8.0  100    150
1       GER  1996   9.0  105    156
2       GER  1997   9.5  107    149
3       GER  1998  10.0  109    165
4       GER  1999  10.5  111    167
5       USA  1995  10.0  100    200
6       USA  1996  11.0  120    220
7       USA  1997  12.0  130    210
8       USA  1998  12.0  120    235
9       USA  1999  13.0  110    250

答案 3 :(得分:1)

我将您的数据复制粘贴到电子表格中。也许重命名不是必需的,但我认为列名variable产生了错误。另外,我没有检查这是否是RAM消耗最少的方法。

import pandas as pd
import numpy as np
df = pd.read_excel('df_countries.xls','Sheet1')
df.columns=['countries','var','1995','1996','1997','1998','1999']
df_new  = pd.melt(df,id_vars=['countries','var'])
df_new.columns = ['countries','var','year','data']
df_new.set_index(['countries','year','var']).unstack('var')

enter image description here