我有两个相同的数据帧(唯一的区别是列的名称 - 索引和值匹配)
df1
Out[300]:
C1 2018-05-17 P1 2018-05-17
Symbol YYYY MM DD Strike
AA 2018 05 18 29.0 0 0
30.0 0 0
df2
Out[301]:
C 2018-05-17 P 2018-05-17
Symbol YYYY MM DD Strike
AA 2018 05 18 29.0 0 0
30.0 0 0
当我尝试加入它们时,pandas与索引
不匹配df1.join(df2,how='outer')
Out[302]:
C1 2018-05-17 P1 2018-05-17 C 2018-05-17 P 2018-05-17
Symbol YYYY MM DD Strike
AA 2018 05 18 29.0 0 0 NaN NaN
30.0 0 0 NaN NaN
29.0 NaN NaN 0 0
30.0 NaN NaN 0 0
似乎“罢工”并未被视为匹配。我怎么能弄清楚这里的区别呢?
df1.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2 entries, (AA, 2018, 05, 18, 29.0) to (AA, 2018, 05, 18, 30.0)
Data columns (total 2 columns):
C1 2018-05-17 2 non-null object
P1 2018-05-17 2 non-null object
dtypes: object(2)
memory usage: 48.3+ KB
df2.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2 entries, (AA, 2018, 05, 18, 29.0) to (AA, 2018, 05, 18, 30.0)
Data columns (total 2 columns):
C 2018-05-17 2 non-null object
P 2018-05-17 2 non-null object
dtypes: object(2)
memory usage: 7.5+ KB
更新
我发现其中一个Strike列是float
类型df1 = df1.reset_index()
df2 = df2.reset_index()
df1.dtypes
Out[346]:
Symbol object
YYYY object
MM object
DD object
Strike float64
C1 2018-05-17 object
P1 2018-05-17 object
dtype: object
df2.dtypes
Out[347]:
Symbol object
YYYY object
MM object
DD object
Strike object
C 2018-05-17 object
P 2018-05-17 object
dtype: object
但是,即使我将dtype更改为object
df1 = df1.reset_index()
df1.Strike = df1.Strike.astype('object')
df1.dtypes
Out[360]:
level_0 int64
index object
Symbol object
YYYY object
MM object
DD object
Strike object
C1 2018-05-17 object
P1 2018-05-17 object
dtype: object
如果我将它设置回索引,它会变回浮动
df1.set_index(['Symbol','YYYY','MM','DD','Strike']).reset_index().dtypes
Out[373]:
Symbol object
YYYY object
MM object
DD object
Strike float64
C1 2018-05-17 object
P1 2018-05-17 object
dtype: object
如何阻止它改回来?
答案 0 :(得分:0)
如果我为一个集合使用字符串而对另一个集合使用int,我能够重新创建您的问题。我的猜测是你的类型与Strike列不同:
tuples1 = [('AA', '2018', '05', '18', '29'), ('AA', '2018', '05', '18', '30')]
index1 = pd.MultiIndex.from_tuples(tuples1, names=('Symbol', 'YYYY', 'MM', 'DD', 'Strike'))
tuples2 = [('AA', '2018', '05', '18', 29), ('AA', '2018', '05', '18', 30)]
index2 = pd.MultiIndex.from_tuples(tuples2, names=('Symbol', 'YYYY', 'MM', 'DD', 'Strike'))
df1 = pd.DataFrame(np.random.rand(2,2), index=index1, columns=['A','B'])
df2 = pd.DataFrame(np.random.rand(2, 2), index=index2, columns=['C', 'D'])
print(df1)
print(df2)
print(df1.join(df2, how='outer'))
输出:
A B C D
Symbol YYYY MM DD Strike
AA 2018 05 18 29 0.891830 0.670130 NaN NaN
30 0.126326 0.921279 NaN NaN
29 NaN NaN 0.962292 0.822756
30 NaN NaN 0.478753 0.559231
如果您尝试:
print(index1.get_level_values(4))
print(index2.get_level_values(4))
然后您会看到它们具有不同的数据类型:
Index(['29', '30'], dtype='object', name='Strike')
Int64Index([29, 30], dtype='int64', name='Strike')
如果你这样做
df1.Strike = df1.Strike.astype('object')
然后你得到:
Symbol object
YYYY object
MM object
DD object
Strike float64
C float64
D float64
取而代之的是
df1.Strike = df1.Strike.astype(str)
这给出了:
Symbol object
YYYY object
MM object
DD object
Strike object
C float64
D float64
最后:
print(df1.join(df2, how='outer'))
输出:
A B C D
Symbol YYYY MM DD Strike
AA 2018 05 18 29 0.755093 0.256132 0.291880 0.404898
30 0.827709 0.254511 0.849849 0.605643
当然,如果您最终比较字符串&#39; 30&#39;对于字符串&#39; 30.0&#39;,所以最好将字符串更改为浮动而不是相反。
答案 1 :(得分:0)
这是一个糟糕的答案,但它有效 - 不知道为什么
如果我将数据帧放在csv中然后读取它,我可以成功设置数据类型
df1.to_csv(r'*.csv')
df1 = pd.read_csv(r'*.csv', dtype = 'str')
df1 = df1.set_index(['Symbol','YYYY','MM','DD','Strike'])