Question

Numpy似乎区分了>>> "{0:.10Df}".format(mpfr('0.75')+mpfr('0.03125')) '0.7500000000' >>> get_context() context(precision=4, real_prec=Default, imag_prec=Default, round=RoundToNearest, real_round=Default, imag_round=Default, emax=3, emin=-4, subnormalize=True, trap_underflow=False, underflow=False, trap_overflow=False, overflow=False, trap_inexact=False, inexact=True, trap_invalid=False, invalid=False, trap_erange=False, erange=False, trap_divzero=False, divzero=False, trap_expbound=False, allow_complex=False) >>>和str类型。例如，我可以做::

object

其中dtype（＆＃39; S＆＃39;）和dtype（＆＃39; O＆＃39;）分别对应>>> import pandas as pd >>> import numpy as np >>> np.dtype(str) dtype('S') >>> np.dtype(object) dtype('O')和str。

然而，大熊猫似乎缺乏这种区别，并强迫object到str。 ::

object

将类型强制为>>> df = pd.DataFrame({'a': np.arange(5)}) >>> df.a.dtype dtype('int64') >>> df.a.astype(str).dtype dtype('O') >>> df.a.astype(object).dtype dtype('O')也无济于事。 ::

dtype('S')

这种行为有什么解释吗？

Answer 1

Numpy的字符串dtypes不是python字符串。

因此，pandas故意使用本机python字符串，这需要一个对象dtype。

首先，让我展示一下numpy的字符串不同的含义：

In [1]: import numpy as np
In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7')
In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object)

现在，＆＃39; x＆＃39;是numpy字符串dtype（固定宽度，类c字符串），y是本机python字符串数组。

如果我们尝试超过7个字符，我们会立即发现差异。字符串dtype版本将被截断：

In [4]: x[1] = 'a really really really long'
In [5]: x
Out[5]:
array(['Testing', 'a reall', 'string'],
      dtype='|S7')

虽然对象dtype版本可以是任意长度：

In [6]: y[1] = 'a really really really long'

In [7]: y
Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object)

接下来，|S dtype字符串无法正确保存unicode，但也有一个unicode固定长度字符串dtype。我暂时跳过一个例子。

最后，numpy的字符串实际上是可变的，而Python字符串则不是。例如：

In [8]: z = x.view(np.uint8)
In [9]: z += 1
In [10]: x
Out[10]:
array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'],
      dtype='|S7')

由于所有这些原因，pandas选择不允许类似C的固定长度字符串作为数据类型。正如您所注意到的那样，尝试将python字符串强制转换为固定的numpy字符串不会在pandas中工作。相反，它总是使用本机python字符串，对大多数用户来说，它的行为更直观。

pandas区分str和object类型

1 个答案: