考虑此zip file背后的.dta
文件。
这是第一行:
>>> df = pd.read_stata('cepr_org_2014.dta', convert_categoricals = False)
>>> df.iloc[0]
year 2014
month 1
minsamp 8
hhid 000936071123039
hhid2 91001
# [...]
>>> df.iloc[0]['wage4']
nan
我使用stata
仔细检查,看起来是正确的。到现在为止还挺好。现在我设置了一些我想要保留的列并重做练习。
>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
convert_categoricals = False,
columns=columns+columns2)
>>> df.iloc[0]
wbho 1
age 65
female 0
wage4 1.7014118346e+38
ind_nber 101
year 2014
month 1
minsamp 8
hhid NaN
hhid2 NaN
fnlwgt 560.1073
Name: 0, dtype: object
添加要保留的列列表后,pandas
wage4
大而不是NaN
。 hhid
和hhid2
创建缺失值。为什么?
脚注:首先加载数据集,然后使用df[columns+columns2]
进行过滤。
答案 0 :(得分:1)
我将此错误追溯到pandas中的错误。我修复了https://github.com/jbuyl/pandas/tree/fix-column-dtype-mixing中的错误,并在修复程序中打开了一个合并的拉取请求,但随时可以检查我的fork / branch。
以下是运行示例的结果:
>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
>>> columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
... convert_categoricals = False,
... columns=columns+columns2)
>>> df.iloc[0]
wbho 1
age 65
female 0
wage4 nan
ind_nber NaN
year 2014
month 1
minsamp 8
hhid 000936071123039
hhid2 91001
fnlwgt 560.107
Name: 0, dtype: object
答案 1 :(得分:1)
pandas/io/stat.py
方法中的_do_select_columns()
来源似乎是一个错误,循环:
dtyplist = []
typlist = []
fmtlist = []
lbllist = []
matched = set()
for i, col in enumerate(data.columns):
if col in column_set:
matched.update([col])
dtyplist.append(self.dtyplist[i])
typlist.append(self.typlist[i])
fmtlist.append(self.fmtlist[i])
lbllist.append(self.lbllist[i])
搞砸了dtypes
的顺序,它与column_set
中出现的序列不匹配。
在此示例中比较dtypes
和df2
的{{1}}:
df3
将其更改为:
In [1]:
import zipfile
z = zipfile.ZipFile('/Users/q6600sl/Downloads/cepr_org_2014.zip')
df= pd.read_stata(z.open('cepr_org_2014.dta'), convert_categoricals = False)
In [2]:
columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
In [3]:
df2 = pd.read_stata(z.open('cepr_org_2014.dta'),
convert_categoricals = False,
columns=columns+columns2)
In [4]:
df2.dtypes
Out[4]:
wbho int16
age int8
female int8
wage4 object
ind_nber object
year float32
month int8
minsamp int8
hhid float64
hhid2 float64
fnlwgt float32
dtype: object
In [5]:
df3 = df[columns+columns2]
In [6]:
df3.dtypes
Out[6]:
wbho int8
age int8
female int8
wage4 float32
ind_nber float64
year int16
month int8
minsamp int8
hhid object
hhid2 object
fnlwgt float32
dtype: object
修复了问题。
(不知道dtyplist = []
typlist = []
fmtlist = []
lbllist = []
#matched = set()
for i in np.hstack([np.argwhere(data.columns==col) for col in columns]).ravel():
# if col in column_set:
# matched.update([col])
dtyplist.append(self.dtyplist[i])
typlist.append(self.typlist[i])
fmtlist.append(self.fmtlist[i])
lbllist.append(self.lbllist[i])
在这里做了什么。似乎以后再也没用过。)