如何根据列值行绑定重复的列?

时间:2019-07-04 06:22:43

标签: python-3.x pandas dataframe

因此,基本上,我正在尝试修复从数据透视表复制粘贴的Excel电子表格。

我必须做一些预处理才能摆脱NaN值。数据集看起来像这样。

                                          0                 1
0                                     Region       Banyule (C)
2                 None (includes bedsitters)            78/0.2
3                                  1 bedroom          1287/2.9
4                                 2 bedrooms         8457/19.4
5                                 3 bedrooms          21865/50
6                         4 or more bedrooms          11366/26
7              Number of bedrooms not stated           645/1.5
9    Average number of bedrooms per dwelling            3.1/--
10    Average number of people per household            2.6/--
11                                    Region       Bayside (C)
13                None (includes bedsitters)            97/0.3
14                                 1 bedroom          1054/3.2
15                                2 bedrooms         7939/23.9
16                                3 bedrooms        13731/41.3
17                        4 or more bedrooms        10031/30.1
18             Number of bedrooms not stated           419/1.3
20   Average number of bedrooms per dwelling            3.1/--
21    Average number of people per household            2.6/--

我在这里做了移调 tr=r_2011.T

我得到的是

 Region  Average number of people per household Region     Average number of people per household
 Banyule (C)                          2.7/--    Bayside(C)    2.6/--

但是,我想获取数据集以将其安排在此结构中。

Region       None (includes bedsitters) 1 bedroom 2 bedrooms 3 bedrooms 4 or more bedrooms
Banyule (C)  78/0.2                     1287/2.9  8457/19.4  21865/50  11366/26
Bayside (C)  97/0.3                     1054/3.2  7939/23.9  13731/41.3  10031/30.1

无法确定枢轴或融化方法是否可以解决问题。

这是文件(https://drive.google.com/open?id=18p0qPiqOaPF1d8NgVVB_qIYNV_HbtXQo)的链接

1 个答案:

答案 0 :(得分:0)

您可以使用以下代码:

# create an auxilary column "Region" from the row with 
# label "Region" and forward fill it for all rows
df['Region']= df['1'].where(df['0'] == 'Region', None)
df['Region'].ffill(inplace=True)
# add the original label and the region to the index
# and unstack it to align the attributes of the regions
df.set_index(['0', 'Region'], inplace=True)
df.unstack()

输出为:

Region                                   Banyule (C)  Bayside (C)
0                                                                
1 bedroom                                   1287/2.9     1054/3.2
2 bedrooms                                 8457/19.4    7939/23.9
3 bedrooms                                  21865/50   13731/41.3
4 or more bedrooms                          11366/26   10031/30.1
Average number of bedrooms per dwelling       3.1/--       3.1/--
Average number of people per household        2.6/--       2.6/--
None (includes bedsitters)                    78/0.2       97/0.3
Number of bedrooms not stated                645/1.5      419/1.3
Region                                   Banyule (C)  Bayside (C)

我使用的数据框是从问题的输出中像这样建立的。所以0是在其中找到“ Region”之类的标签的列,而1是在其中找到诸如“ Banyule(C)”之类的相应值的列:

csv=""";0;1
0;Region;Banyule (C)
2;None (includes bedsitters);78/0.2
3;1 bedroom;1287/2.9
4;2 bedrooms;8457/19.4
5;3 bedrooms;21865/50
6;4 or more bedrooms;11366/26
7;Number of bedrooms not stated;645/1.5
9;Average number of bedrooms per dwelling;3.1/--
10;Average number of people per household;2.6/--
11;Region;Bayside (C)
13;None (includes bedsitters);97/0.3
14;1 bedroom;1054/3.2
15;2 bedrooms;7939/23.9
16;3 bedrooms;13731/41.3
17;4 or more bedrooms;10031/30.1
18;Number of bedrooms not stated;419/1.3
20;Average number of bedrooms per dwelling;3.1/--
21;Average number of people per household;2.6/--"""

import io
import pandas as pd

sb=io.StringIO(csv)
df= pd.read_csv(sb, index_col=0, sep=';')

上面的代码假定您的数据已正确预聚合,并且分配的索引是唯一的。如果不是唯一的,则可以添加一个辅助的“ Num”列,以使其唯一。看起来像:

sb=io.StringIO(csv)
df= pd.read_csv(sb, index_col=0, sep=';')

# just rename the columns to have meaningfull names
df.columns= pd.Index(['Attribute', 'Value'])

# add the region info in a separate column
df['Region']= df['Value'].where(df['Attribute'] == 'Region', None)
df['Region'].ffill(inplace=True)

# now create an auxilary Num column that allows us to create
# a unique index based on Attribute, Region and Num
df['Num']= df.groupby(['Attribute', 'Region']).cumcount()+1

# filter out the row with the Region
df=df[df['Attribute'] != 'Region']

# set the index for unstack, that is the
# columns in the final index + the index used for the
# pivot like unstack operation 
df.set_index(['Attribute', 'Num', 'Region'], inplace=True)
df.unstack(['Region'])

为了对其进行测试,您可以在csv字符串中添加一些行,例如:

22;Number of bedrooms not stated;213/2.1
23;Average number of bedrooms per dwelling;2.3/1.8
24;Average number of people per household;2.7/1.8