Question

我正在尝试将UInt8熊猫系列转换为新的StringDtype。

我可以执行this question中介绍的以下操作，该操作要早于新的string dtype：

import pandas as pd
int_series = pd.Series(range(20), dtype="UInt8")
obj_series = int_series.apply(str)

哪个给了我一系列包含字符串的Object dtype。

但是，如果我尝试将系列转换为新的string dtype，则会出现错误：

>>> string_series = int_series.astype("string")
...
TypeError: data type not understood

请注意，首先将系列转换为Object，然后再转换为string dtype是可行的：

int_series.apply(str).astype("string")

如何将int系列直接转换为字符串？

我在Python 3.7.6上使用的熊猫版本1.0.3

更新：我在熊猫Github页面中发现了this open issue，该页面描述了完全相同的问题。

以上问题的评论指向another open issue，其中涵盖了在不同的ExtensionArray类型之间进行转换所需的但目前尚不可用的功能。

所以答案是直接转换现在不能完成，但是将来可能会实现。

Answer 1

在示例部分的docs中对此进行了解释：

与对象dtype数组不同，StringArray 不允许非字符串值

显示以下示例的地方：

pd.array(['1', 1], dtype="string")

回溯（最近通话最近）： ... ValueError：StringArray需要字符串的object-dtype ndarray。

唯一的解决方案似乎是像在做的那样强制转换为Object dtype，然后然后转换为字符串。

source code of StringArray中也明确指出了这一点，在顶部您会看到警告：

   .. warning::
       Currently, this expects an object-dtype ndarray
       where the elements are Python strings or :attr:`pandas.NA`.
       This may change without warning in the future. Use
       :meth:`pandas.array` with ``dtype="string"`` for a stable way of
       creating a `StringArray` from any sequence.

如果您检查_validate中的验证步骤，您将看到对于非字符串数组，验证步骤将失败：

def _validate(self):
    """Validate that we only store NA or strings."""
    if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
        raise ValueError("StringArray requires a sequence of strings or pandas.NA")
    if self._ndarray.dtype != "object":
        raise ValueError(
            "StringArray requires a sequence of strings or pandas.NA. Got "
            f"'{self._ndarray.dtype}' dtype instead."
        )

对于问题中的示例：

from pandas._libs import lib

lib.is_string_array(np.array(range(20)), skipna=True)
# False

Answer 2

使用numpy.string _

string_series = int_series.astype(np.string_)

熊猫：将int系列转换为新的StringDtype

2 个答案: