调试pandas索引差异

时间:2018-05-17 15:38:38

标签: python pandas join

我有两个相同的数据帧(唯一的区别是列的名称 - 索引和值匹配)

df1
Out[300]: 
                         C1 2018-05-17 P1 2018-05-17
Symbol YYYY MM DD Strike                            
AA     2018 05 18 29.0               0             0
                  30.0               0             0

df2
Out[301]: 
                         C 2018-05-17 P 2018-05-17
Symbol YYYY MM DD Strike                          
AA     2018 05 18 29.0              0            0
                  30.0              0            0

当我尝试加入它们时,pandas与索引

不匹配
df1.join(df2,how='outer')
Out[302]: 
                       C1 2018-05-17 P1 2018-05-17 C 2018-05-17 P 2018-05-17
Symbol YYYY MM DD Strike                                                      

AA     2018 05 18 29.0               0             0          NaN          NaN
                  30.0               0             0          NaN          NaN
                  29.0             NaN           NaN            0            0
                  30.0             NaN           NaN            0            0

似乎“罢工”并未被视为匹配。我怎么能弄清楚这里的区别呢?

df1.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2 entries, (AA, 2018, 05, 18, 29.0) to (AA, 2018, 05, 18, 30.0)
Data columns (total 2 columns):
C1 2018-05-17    2 non-null object
P1 2018-05-17    2 non-null object
dtypes: object(2)
memory usage: 48.3+ KB

df2.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2 entries, (AA, 2018, 05, 18, 29.0) to (AA, 2018, 05, 18, 30.0)
Data columns (total 2 columns):
C 2018-05-17    2 non-null object
P 2018-05-17    2 non-null object
dtypes: object(2)
memory usage: 7.5+ KB

更新

我发现其中一个Strike列是float

类型
df1 = df1.reset_index()

df2 = df2.reset_index()

df1.dtypes
Out[346]: 
Symbol            object
YYYY              object
MM                object
DD                object
Strike           float64
C1 2018-05-17     object
P1 2018-05-17     object
dtype: object

df2.dtypes
Out[347]: 
Symbol          object
YYYY            object
MM              object
DD              object
Strike          object
C 2018-05-17    object
P 2018-05-17    object
dtype: object

但是,即使我将dtype更改为object

df1 = df1.reset_index()

df1.Strike = df1.Strike.astype('object')

df1.dtypes
Out[360]: 
level_0           int64
index            object
Symbol           object
YYYY             object
MM               object
DD               object
Strike           object
C1 2018-05-17    object
P1 2018-05-17    object
dtype: object

如果我将它设置回索引,它会变回浮动

df1.set_index(['Symbol','YYYY','MM','DD','Strike']).reset_index().dtypes
Out[373]: 
Symbol            object
YYYY              object
MM                object
DD                object
Strike           float64
C1 2018-05-17     object
P1 2018-05-17     object
dtype: object

如何阻止它改回来?

2 个答案:

答案 0 :(得分:0)

如果我为一个集合使用字符串而对另一个集合使用int,我能够重新创建您的问题。我的猜测是你的类型与Strike列不同:

tuples1 = [('AA', '2018', '05', '18', '29'), ('AA', '2018', '05', '18', '30')]
index1 = pd.MultiIndex.from_tuples(tuples1, names=('Symbol', 'YYYY', 'MM', 'DD', 'Strike'))

tuples2 = [('AA', '2018', '05', '18', 29), ('AA', '2018', '05', '18', 30)]
index2 = pd.MultiIndex.from_tuples(tuples2, names=('Symbol', 'YYYY', 'MM', 'DD', 'Strike'))

df1 = pd.DataFrame(np.random.rand(2,2), index=index1, columns=['A','B'])
df2 = pd.DataFrame(np.random.rand(2, 2), index=index2, columns=['C', 'D'])

print(df1)
print(df2)

print(df1.join(df2, how='outer'))

输出:

                                 A         B         C         D
Symbol YYYY MM DD Strike                                        
AA     2018 05 18 29      0.891830  0.670130       NaN       NaN
                  30      0.126326  0.921279       NaN       NaN
                  29           NaN       NaN  0.962292  0.822756
                  30           NaN       NaN  0.478753  0.559231

如果您尝试:

print(index1.get_level_values(4))
print(index2.get_level_values(4))

然后您会看到它们具有不同的数据类型:

Index(['29', '30'], dtype='object', name='Strike')
Int64Index([29, 30], dtype='int64', name='Strike')

如果你这样做

df1.Strike = df1.Strike.astype('object')

然后你得到:

Symbol     object
YYYY       object
MM         object
DD         object
Strike    float64
C         float64
D         float64

取而代之的是

df1.Strike = df1.Strike.astype(str)

这给出了:

Symbol     object
YYYY       object
MM         object
DD         object
Strike     object
C         float64
D         float64

最后:

print(df1.join(df2, how='outer'))

输出:

                                 A         B         C         D
Symbol YYYY MM DD Strike                                        
AA     2018 05 18 29      0.755093  0.256132  0.291880  0.404898
                  30      0.827709  0.254511  0.849849  0.605643

当然,如果您最终比较字符串&#39; 30&#39;对于字符串&#39; 30.0&#39;,所以最好将字符串更改为浮动而不是相反。

答案 1 :(得分:0)

这是一个糟糕的答案,但它有效 - 不知道为什么

如果我将数据帧放在csv中然后读取它,我可以成功设置数据类型

df1.to_csv(r'*.csv')
df1 = pd.read_csv(r'*.csv', dtype = 'str')
df1 = df1.set_index(['Symbol','YYYY','MM','DD','Strike'])