熊猫交叉表和规范化索引

时间:2020-05-19 11:20:29

标签: python-3.x pandas crosstab

您如何规范具有多索引的熊猫交叉表?

假设您有这样的df:

# RANDOM DATA
np.random.seed(2)

Year = [2020,2019,2018,2017]*1000
col1 = ['A','B','C','D']*1000
col2 = np.random.randint(0,2,4000)
weight = np.random.randint(1,3,4000)

random.shuffle(Year)
random.shuffle(col1)
random.shuffle(col2)
random.shuffle(weight)

column_names = ['Year', 'weight', 'col1', 'col2']

df = pd.DataFrame(columns=column_names)

df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
df['weight'] = weight

现在您可以做一个交叉表:

pd.crosstab(index=[df['Year']],
            columns=[df['col1'], df['col2']],
            values=df['weight'],
            aggfunc=sum)

col1         A                   B                   C                   D  \
col2         0         1         0         1         0         1         0   
Year                                                                         
2017  0.117962  0.128686  0.128016  0.130697  0.137399  0.122654  0.115282   
2018  0.116832  0.111551  0.120132  0.118152  0.138614  0.125413  0.131353   
2019  0.137584  0.126846  0.127517  0.108725  0.114765  0.138255  0.114765   
2020  0.116356  0.134309  0.113032  0.143617  0.121676  0.118351  0.121676   

col1            
col2         1  
Year            
2017  0.119303  
2018  0.137954  
2019  0.131544  
2020  0.130984  

您如何在多索引中进行标准化?

我的预期输出是:

col1    A   A   B   B   C   C   D   D
col2    0   1   0   1   0   1   0   1
Year                                
2017    0.478   0.522   0.495   0.505   0.528   0.472   0.491   0.509
2018    0.512   0.488   0.504   0.496   0.525   0.475   0.488   0.512
2019    0.520   0.480   0.540   0.460   0.454   0.546   0.466   0.534
2020    0.464   0.536   0.440   0.560   0.507   0.493   0.482   0.518

1 个答案:

答案 0 :(得分:0)

我认为您可以将groupby().value_counts用于规范化:

#### Random data ####
np.random.seed(2)
Year = [2020,2019,2018,2017]*1000
col1 = np.repeat(['A','B','C','D'],1000)
col2 = np.random.randint(0,2,4000)

column_names = ['Year', 'col1', 'col2']

df = pd.DataFrame(columns=column_names)

df['Year'] = Year
df['col1'] = col1
df['col2'] = col2
#### End random data ####


(df.groupby(['Year','col1'])['col2']
   .value_counts(normalize=True)
   .unstack(['col1','col2'])
)

输出:

col1      A             B             C             D       
col2      0      1      1      0      0      1      1      0
Year                                                        
2017  0.524  0.476  0.528  0.472  0.540  0.460  0.512  0.488
2018  0.496  0.504  0.484  0.516  0.560  0.440  0.524  0.476
2019  0.496  0.504  0.552  0.448  0.484  0.516  0.508  0.492
2020  0.500  0.500  0.540  0.460  0.436  0.564  0.484  0.516

更新:对于加权的value_counts

new_df = df.groupby(['Year','col1','col2'])['weight'].sum()
new_df /= new_df.sum(level=['Year','col1'])
new_df = new_df.unstack(['col1','col2'])

输出:

col1         A                   B                   C                   D  \
col2         0         1         0         1         0         1         0   
Year                                                                         
2017  0.441176  0.558824  0.542500  0.457500  0.474286  0.525714  0.482857   
2018  0.500000  0.500000  0.474114  0.525886  0.481268  0.518732  0.489461   
2019  0.521622  0.478378  0.474490  0.525510  0.498801  0.501199  0.531343   
2020  0.493671  0.506329  0.479339  0.520661  0.515789  0.484211  0.513021   

col1            
col2         1  
Year            
2017  0.517143  
2018  0.510539  
2019  0.468657  
2020  0.486979