因此,基本上,我正在尝试修复从数据透视表复制粘贴的Excel电子表格。
我必须做一些预处理才能摆脱NaN值。数据集看起来像这样。
0 1
0 Region Banyule (C)
2 None (includes bedsitters) 78/0.2
3 1 bedroom 1287/2.9
4 2 bedrooms 8457/19.4
5 3 bedrooms 21865/50
6 4 or more bedrooms 11366/26
7 Number of bedrooms not stated 645/1.5
9 Average number of bedrooms per dwelling 3.1/--
10 Average number of people per household 2.6/--
11 Region Bayside (C)
13 None (includes bedsitters) 97/0.3
14 1 bedroom 1054/3.2
15 2 bedrooms 7939/23.9
16 3 bedrooms 13731/41.3
17 4 or more bedrooms 10031/30.1
18 Number of bedrooms not stated 419/1.3
20 Average number of bedrooms per dwelling 3.1/--
21 Average number of people per household 2.6/--
我在这里做了移调
tr=r_2011.T
我得到的是
Region Average number of people per household Region Average number of people per household
Banyule (C) 2.7/-- Bayside(C) 2.6/--
但是,我想获取数据集以将其安排在此结构中。
Region None (includes bedsitters) 1 bedroom 2 bedrooms 3 bedrooms 4 or more bedrooms
Banyule (C) 78/0.2 1287/2.9 8457/19.4 21865/50 11366/26
Bayside (C) 97/0.3 1054/3.2 7939/23.9 13731/41.3 10031/30.1
无法确定枢轴或融化方法是否可以解决问题。
这是文件(https://drive.google.com/open?id=18p0qPiqOaPF1d8NgVVB_qIYNV_HbtXQo)的链接
答案 0 :(得分:0)
您可以使用以下代码:
# create an auxilary column "Region" from the row with
# label "Region" and forward fill it for all rows
df['Region']= df['1'].where(df['0'] == 'Region', None)
df['Region'].ffill(inplace=True)
# add the original label and the region to the index
# and unstack it to align the attributes of the regions
df.set_index(['0', 'Region'], inplace=True)
df.unstack()
输出为:
Region Banyule (C) Bayside (C)
0
1 bedroom 1287/2.9 1054/3.2
2 bedrooms 8457/19.4 7939/23.9
3 bedrooms 21865/50 13731/41.3
4 or more bedrooms 11366/26 10031/30.1
Average number of bedrooms per dwelling 3.1/-- 3.1/--
Average number of people per household 2.6/-- 2.6/--
None (includes bedsitters) 78/0.2 97/0.3
Number of bedrooms not stated 645/1.5 419/1.3
Region Banyule (C) Bayside (C)
我使用的数据框是从问题的输出中像这样建立的。所以0是在其中找到“ Region”之类的标签的列,而1是在其中找到诸如“ Banyule(C)”之类的相应值的列:
csv=""";0;1
0;Region;Banyule (C)
2;None (includes bedsitters);78/0.2
3;1 bedroom;1287/2.9
4;2 bedrooms;8457/19.4
5;3 bedrooms;21865/50
6;4 or more bedrooms;11366/26
7;Number of bedrooms not stated;645/1.5
9;Average number of bedrooms per dwelling;3.1/--
10;Average number of people per household;2.6/--
11;Region;Bayside (C)
13;None (includes bedsitters);97/0.3
14;1 bedroom;1054/3.2
15;2 bedrooms;7939/23.9
16;3 bedrooms;13731/41.3
17;4 or more bedrooms;10031/30.1
18;Number of bedrooms not stated;419/1.3
20;Average number of bedrooms per dwelling;3.1/--
21;Average number of people per household;2.6/--"""
import io
import pandas as pd
sb=io.StringIO(csv)
df= pd.read_csv(sb, index_col=0, sep=';')
上面的代码假定您的数据已正确预聚合,并且分配的索引是唯一的。如果不是唯一的,则可以添加一个辅助的“ Num”列,以使其唯一。看起来像:
sb=io.StringIO(csv)
df= pd.read_csv(sb, index_col=0, sep=';')
# just rename the columns to have meaningfull names
df.columns= pd.Index(['Attribute', 'Value'])
# add the region info in a separate column
df['Region']= df['Value'].where(df['Attribute'] == 'Region', None)
df['Region'].ffill(inplace=True)
# now create an auxilary Num column that allows us to create
# a unique index based on Attribute, Region and Num
df['Num']= df.groupby(['Attribute', 'Region']).cumcount()+1
# filter out the row with the Region
df=df[df['Attribute'] != 'Region']
# set the index for unstack, that is the
# columns in the final index + the index used for the
# pivot like unstack operation
df.set_index(['Attribute', 'Num', 'Region'], inplace=True)
df.unstack(['Region'])
为了对其进行测试,您可以在csv字符串中添加一些行,例如:
22;Number of bedrooms not stated;213/2.1
23;Average number of bedrooms per dwelling;2.3/1.8
24;Average number of people per household;2.7/1.8