Pandas:添加列到过滤器会弄乱数据结构

时间:2015-08-03 09:05:13

标签: python pandas

考虑此zip file背后的.dta文件。

这是第一行:

>>> df = pd.read_stata('cepr_org_2014.dta', convert_categoricals = False)
>>> df.iloc[0]
year                   2014
month                     1
minsamp                   8
hhid        000936071123039
hhid2                 91001
# [...]
>>> df.iloc[0]['wage4']
nan

我使用stata仔细检查,看起来是正确的。到现在为止还挺好。现在我设置了一些我想要保留的列并重做练习。

>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
    convert_categoricals = False,
    columns=columns+columns2)
>>> df.iloc[0]
wbho                       1
age                       65
female                     0
wage4       1.7014118346e+38
ind_nber                 101
year                    2014
month                      1
minsamp                    8
hhid                     NaN
hhid2                    NaN
fnlwgt              560.1073
Name: 0, dtype: object

添加要保留的列列表后,pandas

  • 不再理解缺失值,wage4大而不是NaN
  • hhidhhid2创建缺失值。

为什么?

脚注:首先加载数据集,然后使用df[columns+columns2]进行过滤。

2 个答案:

答案 0 :(得分:1)

我将此错误追溯到pandas中的错误。我修复了https://github.com/jbuyl/pandas/tree/fix-column-dtype-mixing中的错误,并在修复程序中打开了一个合并的拉取请求,但随时可以检查我的fork / branch。

以下是运行示例的结果:

>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
>>> columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
...     convert_categoricals = False,
...     columns=columns+columns2)
>>> df.iloc[0]
wbho                      1
age                      65
female                    0
wage4                   nan
ind_nber                NaN
year                   2014
month                     1
minsamp                   8
hhid        000936071123039
hhid2                 91001
fnlwgt              560.107
Name: 0, dtype: object

答案 1 :(得分:1)

pandas/io/stat.py方法中的_do_select_columns()来源似乎是一个错误,循环:

dtyplist = []
typlist = []
fmtlist = []
lbllist = []
matched = set()
for i, col in enumerate(data.columns):
    if col in column_set:
        matched.update([col])
        dtyplist.append(self.dtyplist[i])
        typlist.append(self.typlist[i])
        fmtlist.append(self.fmtlist[i])
        lbllist.append(self.lbllist[i])

搞砸了dtypes的顺序,它与column_set中出现的序列不匹配。

在此示例中比较dtypesdf2的{​​{1}}:

df3

将其更改为:

In [1]:

import zipfile
z = zipfile.ZipFile('/Users/q6600sl/Downloads/cepr_org_2014.zip')
df= pd.read_stata(z.open('cepr_org_2014.dta'), convert_categoricals = False)
In [2]:

columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
In [3]:

df2 = pd.read_stata(z.open('cepr_org_2014.dta'),
                    convert_categoricals = False,
                    columns=columns+columns2)
In [4]:

df2.dtypes
Out[4]:
wbho          int16
age            int8
female         int8
wage4        object
ind_nber     object
year        float32
month          int8
minsamp        int8
hhid        float64
hhid2       float64
fnlwgt      float32
dtype: object
In [5]:

df3 = df[columns+columns2]
In [6]:

df3.dtypes
Out[6]:
wbho           int8
age            int8
female         int8
wage4       float32
ind_nber    float64
year          int16
month          int8
minsamp        int8
hhid         object
hhid2        object
fnlwgt      float32
dtype: object

修复了问题。

(不知道dtyplist = [] typlist = [] fmtlist = [] lbllist = [] #matched = set() for i in np.hstack([np.argwhere(data.columns==col) for col in columns]).ravel(): # if col in column_set: # matched.update([col]) dtyplist.append(self.dtyplist[i]) typlist.append(self.typlist[i]) fmtlist.append(self.fmtlist[i]) lbllist.append(self.lbllist[i]) 在这里做了什么。似乎以后再也没用过。)