Question

我的任务是将数据从Excel读取到数据框。数据有点混乱，要清理，我已经完成了：

df_1 = pd.read_excel(offers[0])
df_1 = df_1.rename(columns={'Наименование [Дата Файла: 29.05.2019 время: 10:29:42 ]':'good_name', 
                     'Штрихкод':'barcode', 
                     'Цена шт. руб.':'price',
                     'Остаток': 'balance'
                    })
df_1 = df_1[new_columns]
# I don't know why but without replacing NaN with another char code doesn't work
df_1.barcode = df_1.barcode.fillna('_')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to numeric
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)
df_1.head()

它返回类型为float64的列条形码（为什么？）

0    0.000000e+00
1    7.613037e+12
2    7.613037e+12
3    7.613034e+12
4    7.613035e+12
Name: barcode, dtype: float64

然后我尝试将该列转换为整数。

df_1.barcode = df_1.barcode.astype(int)

但我总是得到愚蠢的负数。

df_1.barcode[0:5]
0             0
1   -2147483648
2   -2147483648
3   -2147483648
4   -2147483648

Name: barcode, dtype: int32

多亏了@Will和@micric，我终于找到了解决方案。

df_1 = pd.read_excel(offers[0])
df_1 = df_1[new_columns]
# replacing NaN with 0, it'll help to convert the column explicitly to dtype integer
df_1.barcode = df_1.barcode.fillna('0')
# remove all non-numeric characters
df_1.barcode = df_1.barcode.apply(lambda row: re.sub('[^0-9]', '', row))
# convert str to integer
df_1.barcode = pd.to_numeric(df_1.barcode, downcast='integer')

简历：

pd.to_numeric将NaN转换为float64。作为列的结果无论是NaN值还是非Nan值，我们都应该期望列dtype float64。
检查您要处理的电话号码的大小。 int32有其限制，即是2 ** 32 = 4294967296。伙计们，非常感谢您的帮助！

Answer 1

一个问题中的许多问题。

所以您期望的dtype ...

pd.to_numeric(df_1.barcode, downcast='integer').fillna(0)

pd.to_numeric下转换为整数会给您一个整数，但是，您的数据中包含NaN，而熊猫需要使用float64类型来表示NaN

Answer 2

该数字是32位下限。您的电话号码超出了您要使用的int32范围，因此它将返回限制（请注意，2 ** 32 = 4294967296除以您的电话号码2 2147483648）。

您应该改用astype（int64）。

Answer 3

我使用

遇到了与OP相同的问题

astype(np.int64)

解决了我的问题，请参阅链接here。

我喜欢此解决方案，因为它与我更改pandas列的列类型的习惯保持一致，也许有人可以检查这些解决方案的性能。

应用于浮点列的熊猫astype（int）返回负数

3 个答案: