我试图在函数内部将一列与另一列分开:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
可以看出,我在DataFrame中除以一列,但我得到一个相当奇怪的错误:
ValueError: putmask: mask and data must be the same size
我必须承认,这是我第一次看到这个错误。它似乎表明DF和列具有不同的长度,但显然(因为列来自DataFrame)它们不是。
进一步的改进是使用此函数循环数据管理过程而不是特定年份的集合(数据来自Quarterly Census of Employment and Wages中的beta series'单个文件')。与1990-2000时期相关的集合顺利出现,但是2001年引发了这个错误。我恐怕多年来一直无法确定结构上的差异,即使我可以,它如何解释长度不匹配?
任何想法都会非常感激。
编辑(2/1/2014):谢谢你看看汤姆。根据要求,pandas版本为0.13.0
,并且有问题的数据文件位于BLS FTP站点上的here。为了澄清一致结构的含义,每年都有相同的变量集和dtype(除了一致的数据代码结构)。
编辑(2014年2月1日):也许分享整个功能会很有用:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]
答案 0 :(得分:1)
对我来说工作正常(这是在0.13.1上,但IIRC我认为这个特定领域的任何内容都没有改变,但它可能是一个修复过的错误。)
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]