Question

使用＆＃39;外部＆＃39;在某些值上合并两个索引数据帧时merge，python / pandas会自动将Null（NaN）值添加到它无法匹配的字段中。这是正常行为，但它会更改数据类型，您必须重新显示列应具有的数据类型。

合并后，

fillna()或dropna()似乎不会立即保留数据类型。我需要一个桌子结构吗？

通常我会运行numpy np.where(field.isnull() etc)但这意味着要运行所有列。

有解决方法吗？

Answer 1

我认为没有任何非常优雅/有效的方法。您可以通过跟踪原始数据类型然后在合并后转换列来完成此操作，如下所示：

import pandas as pd

# all types are originally ints
df = pd.DataFrame({'a': [1]*10, 'b': [1, 2] * 5, 'c': range(10)})
df2 = pd.DataFrame({'e': [1, 1], 'd': [1, 2]})

# track the original dtypes
orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())

# join the dataframe
joined = df.join(df2, how='outer')

# columns with nans are now float dtype
print joined.dtypes

# replace nans with suitable int value
joined.fillna(-1, inplace=True)

# re-cast the columns as their original dtype
joined_orig_types = joined.apply(lambda x: x.astype(orig[x.name]))

print joined_orig_types.dtypes

Answer 2

这实际上仅是bool或int dtypes的问题。 float，object和datetime64[ns]已经可以容纳NaN或NaT而无需更改类型。

因此，我建议对整数或Int64列使用新的bool类型，该类型可以存储NaN。对于布尔值，需要将其转换为1或0而不是True或False，然后转换为Int64。您应该对联接之前的所有int和bool列执行此操作，但我仅在df2上说明其联接之后获得NaN行的列：

import pandas as pd

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})

df2 = df2.astype('int').astype('Int64')
df2.dtypes
#d    Int64
#e    Int64
#dtype: object

df.join(df2)
#   a  b  c    d    e
#0  1  1  0    1    1
#1  1  2  1    2    0
#2  1  1  2  NaN  NaN
#3  1  2  3  NaN  NaN
#4  1  1  4  NaN  NaN
#5  1  2  5  NaN  NaN

#a    int64
#b    int64
#c    int64
#d    Int64
#e    Int64
#dtype: object

这里的好处是，直到需要时，任何内容都不会被抛弃。例如，在其他解决方案中，如果您进行.fillna(-1.72)，则在调用int(-1.72)时可能会得到不需要的答案，然后将填充值强制为-1。这在某些情况下可能有用，但在其他情况下却很危险。

使用Int64时，填充值将保持为您指定的值，并且仅当您用非整数填充时，该列才会被向上偏移。如果您执行.fillna('Missing')之类的操作，它也不会引发错误，因为它从不尝试将字符串强制转换为int类型。

Answer 3

或者您也可以在两个dtypes的{{1}}上进行合并/附加并应用df：

astype()

Answer 4

@hume答案的一个简单版本，直接获取原始类型，然后使用astype并一击即得数据类型，这是代码：

orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())
joined = df.join(df2, how='outer')
new_joined = joined.fillna(-1).astype(orig)
print(new_joined)
print(new_joined.dtypes)

输出：

   a  b  c  d  e
0  1  1  0  1  1
1  1  2  1  2  1
2  1  1  2 -1 -1
3  1  2  3 -1 -1
4  1  1  4 -1 -1
5  1  2  5 -1 -1
6  1  1  6 -1 -1
7  1  2  7 -1 -1
8  1  1  8 -1 -1
9  1  2  9 -1 -1
a    int64
b    int64
c    int32
d    int64
e    int64
dtype: object

Answer 5

自熊猫1.0.0起，我相信您还有另一种选择，那就是首先使用convert_dtypes。这样可以将数据框列转换为支持pd.NA的dtype，从而避免了NaN问题。与this答案不同，这也保留了布尔值。

...

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})
df = df.convert_dtypes()
df2 = df2.convert_dtypes()
print(df.join(df2))

#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>

外部合并后保留Dataframe列数据类型

5 个答案: