pandas列除法ValueError(putmask:mask和data必须大小相同)

时间:2014-02-02 17:23:15

标签: python pandas division dataframe

我试图在函数内部将一列与另一列分开:

lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')

可以看出,我在DataFrame中除以一列,但我得到一个相当奇怪的错误:

ValueError: putmask: mask and data must be the same size

我必须承认,这是我第一次看到这个错误。它似乎表明DF和列具有不同的长度,但显然(因为列来自DataFrame)它们不是。

进一步的改进是使用此函数循环数据管理过程而不是特定年份的集合(数据来自Quarterly Census of Employment and Wages中的beta series'单个文件')。与1990-2000时期相关的集合顺利出现,但是2001年引发了这个错误。我恐怕多年来一直无法确定结构上的差异,即使我可以,它如何解释长度不匹配?

任何想法都会非常感激。

编辑(2/1/2014):谢谢你看看汤姆。根据要求,pandas版本为0.13.0,并且有问题的数据文件位于BLS FTP站点上的here。为了澄清一致结构的含义,每年都有相同的变量集和dtype(除了一致的数据代码结构)。

编辑(2014年2月1日):也许分享整个功能会很有用:

def qcew(f,m_dict):
    '''Function reads in file and captures county level aggregations with government contributions'''
    #Read in file
    cew=pd.read_csv(f)

    #Create string version of area fips
    cew['fips']=cew['area_fips'].astype(str)

    #Generate description variables
    cew['area']=cew['fips'].map(m_dict['area'])
    cew['industry']=cew['industry_code'].map(m_dict['industry'])
    cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
    cew['own']=cew['own_code'].map(m_dict['ownership'])
    cew['size']=cew['size_code'].map(m_dict['size'])

    #Generate boolean masks
    lagg_mask=cew['agglvl_code']==73
    lsize_mask=cew['size_code']==0

    #Subset data to above specifications
    cew_super=cew[lagg_mask & lsize_mask]

    #Define column subset
    lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
              'total_annual_wages','own_code']

    #Subset to desired columns
    cew_sub=cew_super[lsub_cols]

    #Rename columns
    cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']

    #Set index
    cew_sub.set_index(['year','fips','cty'],inplace=True)

    #Capture total wage base and the contributions of Federal, State, and Local
    cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
    cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
    cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
    cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()

    #Convert to DFs for join
    lbase=DataFrame(cew_base).rename(columns={0:'base'})
    lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
    lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
    llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})

    #Join these series
    lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)

    #Diag prints
    print f
    print lcontrib_lev.head()
    print lcontrib_lev.describe()
    print '*****************************\n'

    #Calculate proportional contributions (failure point)
    lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')

    #Group base data by year, county, and industry
    cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()

    #Join contributions to joined data
    cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])

    return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]

1 个答案:

答案 0 :(得分:1)

对我来说工作正常(这是在0.13.1上,但IIRC我认为这个特定领域的任何内容都没有改变,但它可能是一个修复过的错误。)

In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]: 
                  base  fed_wage  st_wage  loc_wage
year fips  cty                                     
2001 1000  1000    NaN       NaN      NaN       NaN
           1000    NaN       NaN      NaN       NaN
     10000 10000   NaN       NaN      NaN       NaN
           10000   NaN       NaN      NaN       NaN
     10001 10001   NaN       NaN      NaN       NaN

[5 rows x 4 columns]

In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]: 
                  base  fed_wage   st_wage  loc_wage
year fips  cty                                      
2001 CS566 CS566     1  0.000000  0.000000  0.000000
     US000 US000     1  0.022673  0.027978  0.073828
     USCMS USCMS     1  0.000000  0.000000  0.000000
     USMSA USMSA     1  0.000000  0.000000  0.000000
     USNMS USNMS     1  0.000000  0.000000  0.000000

[5 rows x 4 columns]