`convert_dtypes`

Question

我已经在Pandas中读取了一个SQL查询，并且这些值以dtype＆＃39; object＆＃39;形式出现，尽管它们是字符串，日期和整数。我能够转换日期＆＃39;对象＆＃39;到Pandas datetime dtype，但我在尝试转换字符串和整数时遇到错误。

以下是一个例子：

>>> import pandas as pd
>>> df = pd.read_sql_query('select * from my_table', conn)
>>> df
    id    date          purchase
 1  abc1  2016-05-22    1
 2  abc2  2016-05-29    0
 3  abc3  2016-05-22    2
 4  abc4  2016-05-22    0

>>> df.dtypes
 id          object
 date        object
 purchase    object
 dtype: object

将df['date']转换为日期时间：

>>> pd.to_datetime(df['date'])
 1  2016-05-22
 2  2016-05-29
 3  2016-05-22
 4  2016-05-22
 Name: date, dtype: datetime64[ns]

但是在尝试将df['purchase']转换为整数时出现错误：

>>> df['purchase'].astype(int)
 ....
 pandas/lib.pyx in pandas.lib.astype_intsafe (pandas/lib.c:16667)()
 pandas/src/util.pxd in util.set_value_at (pandas/lib.c:67540)()

 TypeError: long() argument must be a string or a number, not 'java.lang.Long'

注意：我在尝试.astype('float')

时收到类似错误

当尝试转换为字符串时，似乎没有任何事情发生。

>>> df['id'].apply(str)
 1 abc1
 2 abc2
 3 abc3
 4 abc4
 Name: id, dtype: object

Answer 1

根据@piRSquared的评论记录对我有用的答案。

我需要先转换为字符串，然后转换为整数。

>>> df['purchase'].astype(str).astype(int)

Answer 2

熊猫> = 1.0

`convert_dtypes`

（自我）接受的答案未考虑对象列中存在NaN的可能性。

df = pd.DataFrame({
     'a': [1, 2, np.nan], 
     'b': [True, False, np.nan]}, dtype=object) 
df                                                                         

     a      b
0    1   True
1    2  False
2  NaN    NaN

df['a'].astype(str).astype(int) # raises ValueError

之所以窒息，是因为NaN被转换为字符串“ nan”，并且进一步强制转换为整数的尝试将失败。为避免此问题，我们可以使用convert_dtypes将列软转换为相应的nullable type：

df.convert_dtypes()                                                        

      a      b
0     1   True
1     2  False
2  <NA>   <NA>

df.convert_dtypes().dtypes                                                 

a      Int64
b    boolean
dtype: object

如果您的数据中混有一些杂乱的文本，则可以使用pd.to_numeric作为初始步骤：

s = pd.Series(['1', '2', '...'])
s.convert_dtypes()  # converts to string, which is not what we want

0      1
1      2
2    ...
dtype: string 

# coerces non-numeric junk to NaNs
pd.to_numeric(s, errors='coerce')

0    1.0
1    2.0
2    NaN
dtype: float64

# one final `convert_dtypes` call to convert to nullable int
pd.to_numeric(s, errors='coerce').convert_dtypes() 

0       1
1       2
2    <NA>
dtype: Int64

Answer 3

我的火车数据包含三个特征，它们是应用astype之后的对象，它将对象转换为数字，但是在此之前，您需要执行一些预处理步骤：

train.dtypes

C12       object
C13       object
C14       Object

train['C14'] = train.C14.astype(int)

train.dtypes

C12       object
C13       object
C14       int32

Answer 4

无法发表评论，因此将其发布为答案，大约在@piRSquared / @cyril的解决方案和@cs95的解决方案之间：

如@ cs95所述，如果您的数据包含NaN或None，则在随后尝试转换为int时，转换为字符串类型将引发错误。

但是，如果您的数据由（数字）字符串组成，则使用convert_dtypes会将其转换为字符串类型，除非您按照@ cs95的建议使用pd.to_numeric（可能与df.apply()结合使用）

如果您的数据仅包含数字字符串（包括NaN或None，但没有任何非数字“垃圾”），则可能更简单的选择是先转换为float，然后转换为{{3 }}由pandas提供（已经存在于0.24版中）（另请参见此nullable-integer extension dtypes）：

df['purchase'].astype(float).astype('Int64')

请注意，最近在answer上对此进行了讨论（虽然当前是-一个未解决的封闭问题），并且在64位整数很长的情况下，您可能必须显式转换为{{1 }}，以避免在转换过程中产生近似值。

Answer 5

df['col_name'] = pd.to_numeric(df['col_name'])

这是一个更好的选择

Answer 6

很简单

pd.factorize(df.purchase)[0]

示例：

labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])`

labels
# array([0, 0, 1, 2, 0])

uniques
# array(['b', 'a', 'c'], dtype=object)

Answer 7

就我而言，我有一个混合数据的df：

df:
                     0   1   2    ...                  242                  243                  244
0   2020-04-22T04:00:00Z   0   0  ...          3,094,409.5         13,220,425.7          5,449,201.1
1   2020-04-22T06:00:00Z   0   0  ...          3,716,941.5          8,452,012.9          6,541,599.9
....

浮子实际上是对象，但我需要它们是真实的浮子。

要解决此问题，请参考上面@AMC的评论：

def coerce_to_float(val):
    try:
       return float(val)
    except ValueError:
       return val

df = df.applymap(lambda x: coerce_to_float(x))

Answer 8

要改变数据类型并保存到数据框中，需要按如下方式替换新的数据类型：

ds["cat"] = pd.to_numeric(ds["cat"]) 或 ds["cat"] = ds["cat"].astype(int)

Answer 9

请按照以下步骤操作：

1.清理你的文件 - ＆gt;以csv格式打开您的数据文件，看看有＆＃34;？＆＃34;代替空位并删除所有这些。

2.删除包含缺失值的行 e.g：

df.dropna(subset=["normalized-losses"], axis = 0 , inplace= True)

3.现在使用astype进行转换

df["normalized-losses"]=df["normalized-losses"].astype(int)

注意：如果仍然在你的程序中找到错误，那么再次检查你的csv文件，在excel中打开它以查找是否有＆＃34;？＆＃34;在您需要的列中，然后删除它并保存文件并返回并运行您的程序。

评论成功！如果有效。：）

Answer 10

This was my data

## list of columns 
l1 = ['PM2.5', 'PM10', 'TEMP', 'BP', ' RH', 'WS','CO', 'O3', 'Nox', 'SO2'] 

for i in l1:
 for j in range(0, 8431): #rows = 8431
   df[i][j] = int(df[i][j])

我建议您仅将其用于少量数据。该代码的复杂度为O（n ^ 2）。

Answer 11

将对象转换为数字 end 或 $recordsKeyedByDate = []; foreach ($recordsNew as $record) { $recordsKeyedByDate[$record->datenew][$record->storename] = $record->totalpricewithtax; } $storeNames = ['A', 'B', ...]; foreach ($recordsKeyedByDate as $date => $recordsKeyedByStoreName) { // output $date foreach ($storeNames as $storeName) { // output $recordsKeyedByStoreName[$storeName]->totalpricewithtax } }。

代码是：--

int

熊猫：转换dtype＆＃39;对象＆＃39;到int

11 个答案:

熊猫> = 1.0

`convert_dtypes`