Pandas为什么我的列数据类型会改变?

时间:2017-02-25 12:17:58

标签: python pandas

请有人解释为什么当我使用pandas创建一个简单的异构数据帧时,当我单独访问每一行时,数据类型会发生变化。

e.g。

scene_df = pd.DataFrame({
    'magnitude': np.random.uniform(0.1, 0.3, (10,)),
    'x-center': np.random.uniform(-1, 1, (10,)),
    'y-center': np.random.uniform(-1, 1, (10,)),
    'label': np.random.randint(2, size=(10,), dtype='u1')})

scene_df.dtypes

打印:

label          uint8
magnitude    float64
x-center     float64
y-center     float64
dtype: object

但是当我迭代行时:

[r['label'].dtype for i, r in scene_df.iterrows()]

我为标签获取了float64

[dtype('float64'),
 dtype('float64'),
 dtype('float64'),
 dtype('float64'),
 dtype('float64'),
...

编辑:

回答我打算用这个做的事情:

def square(mag, x, y):
    wh = np.array([mag, mag])
    pos = np.array((x, y)) - wh/2
    return plt.Rectangle(pos, *wh)

def circle(mag, x, y):
    return plt.Circle((x, y), mag)

shape_fn_lookup = [square, circle]

最终成为这段丑陋的代码:

[shape_fn_lookup[int(s['label'])](
        *s[['magnitude', 'x-center', 'y-center']])
 for i, s in scene_df.iterrows()]

这给出了我可能会绘制的一堆圆圈和正方形:

[<matplotlib.patches.Circle at 0x7fcf3ea00d30>,
 <matplotlib.patches.Circle at 0x7fcf3ea00f60>,
 <matplotlib.patches.Rectangle at 0x7fcf3eb4da90>,
 <matplotlib.patches.Circle at 0x7fcf3eb4d908>,
...
]

甚至DataFrame.to_dict('records')执行此数据类型转换:

type(scene_df.to_dict('records')[0]['label'])

2 个答案:

答案 0 :(得分:1)

因为iterrows()返回一个系列,其索引由每行的列名组成。

Pandas.Series只有一个dtype,因此会被下调到float64

In [163]: first_row = list(scene_df.iterrows())[0][1]

In [164]: first_row
Out[164]:
label        0.000000
magnitude    0.293681
x-center    -0.628142
y-center    -0.218315
Name: 0, dtype: float64   # <--------- NOTE

In [165]: type(first_row)
Out[165]: pandas.core.series.Series

In [158]: [(type(r), r.dtype) for i, r in scene_df.iterrows()]
Out[158]:
[(pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64')),
 (pandas.core.series.Series, dtype('float64'))]

答案 1 :(得分:1)

我建议使用itertuples而不是interrow,因为iterrows为每一行返回一个Series,它不会在行中保留dtypes(dtypes在DataFrames的列中保留)。

[type(r.label) for r in scene_df.itertuples()]

输出:

[numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8,
 numpy.uint8]