Pandas+Numpy: filtering DF columns by numerical dtype gives unexpected behavior

时间:2016-03-02 11:03:20

标签: python numpy pandas

I have the following table (sample to ease whoever wants to try):

serial;spectra;name;UKST;ra;dec;ra2000;dec2000;BJG;BJSEL;BJG_OLD;BJSELOLD;GALEXT;SB_BJ;SR_R;z;z_helio;obsrun;quality;abemma;Z_ABS;KBESTR;R_CRCOR;Z_EMI;NMBEST;SNR;ETA_TYPE 1;2;TGS436Z001;349;00:11:55.72;-32:32:55.2;00:14:27.05;-32:16:14.6;19.424;19.362;19.430;19.390;0.062;19.368;18.286;0.2981;0.2981;01SEP;4;1;0.2981;5;4.5700;0.2984;1;3.8;-99.90000 2;1;TGS496Z001;349;00:11:59.29;-33:14:41.3;00:14:30.55;-32:58:00.7;18.842;18.789;18.870;18.840;0.053;18.688;17.291;0.1229;0.1228;01OCT;5;1;0.1229;1;14.3800;-9.9990;0;47.6;-2.58920 3;1;TGS435Z001;349;00:11:49.37;-32:39:57.4;00:14:20.71;-32:23:16.8;18.320;18.265;18.350;18.310;0.055;18.336;17.138;0.1038;0.1038;01SEP;4;1;0.1038;1;9.3800;0.1032;1;28.4;-2.46500

sidenote:
You should have a pandas dataframe from the above '''data-sample''' as follows:
>>> import StringIO
>>> _tmp = StringIO.StringIO()
>>> _tmp.write('''data-sample''')
>>> _tmp.seek(0)
>>> import pandas
>>> df = pandas.read_csv(_tmp,delimiter=';')

The correspoding df we get has the following dtypes information:

>>> df.dtypes
serial        int64
spectra       int64
name         object
UKST          int64
ra           object
dec          object
ra2000       object
dec2000      object
BJG         float64
BJSEL       float64
BJG_OLD     float64
BJSELOLD    float64
GALEXT      float64
SB_BJ       float64
SR_R        float64
z           float64
z_helio     float64
obsrun       object
quality       int64
abemma        int64
Z_ABS       float64
KBESTR        int64
R_CRCOR     float64
Z_EMI       float64
NMBEST        int64
SNR         float64
ETA_TYPE    float64
dtype: object

All I wanna do is simply filter the column names given their data types; in particular, I want to keep the numeric columns. So, all I thought I should do was to check whether their dtype was a numpy.number,

>>> filter(lambda c:df[c].dtypes == numpy.number,df.columns)
['BJG',
'BJSEL',
'BJG_OLD',
'BJSELOLD',
'GALEXT',
'SB_BJ',
'SR_R',
'z',
'z_helio',
'Z_ABS',
'R_CRCOR',
'Z_EMI',
'SNR',
'ETA_TYPE']

but as we can see all I get are the >float columns, the >int ones are left behind.

I do get the result I want by doing:

>>> filter(lambda c:df[c].dtypes == numpy.floating or df[c].dtypes == numpy.integer, df.columns)
['serial',
'spectra',
'UKST',
'BJG',
'BJSEL',
'BJG_OLD',
'BJSELOLD',
'GALEXT',
'SB_BJ',
'SR_R',
'z',
'z_helio',
'quality',
'abemma',
'Z_ABS',
'KBESTR',
'R_CRCOR',
'Z_EMI',
'NMBEST',
'SNR',
'ETA_TYPE']

(Obs: numpy.floating or numpy.number give same result at the line above.)

The question here is: isn't numpy.number expected to "represent" any numerical type in numpy (int,float,complex,etc)? After reading the corresponding classes hierarchy at numpy.core.numerictypes help pages, the above presented behavior is unexpected to me... Does anybody has a comment on that? Am I missing something?

Cheers.

1 个答案:

答案 0 :(得分:0)

使用select_dtypes并在列表对象中传递np.number

In [160]:
df.select_dtypes([np.number]).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 21 columns):
serial      3 non-null int64
spectra     3 non-null int64
UKST        3 non-null int64
BJG         3 non-null float64
BJSEL       3 non-null float64
BJG_OLD     3 non-null float64
BJSELOLD    3 non-null float64
GALEXT      3 non-null float64
SB_BJ       3 non-null float64
SR_R        3 non-null float64
z           3 non-null float64
z_helio     3 non-null float64
quality     3 non-null int64
abemma      3 non-null int64
Z_ABS       3 non-null float64
KBESTR      3 non-null int64
R_CRCOR     3 non-null float64
Z_EMI       3 non-null float64
NMBEST      3 non-null int64
SNR         3 non-null float64
ETA_TYPE    3 non-null float64
dtypes: float64(14), int64(7)
memory usage: 528.0 bytes