我不知道发生了什么,标题只是一阶近似。我正在尝试加入两个数据框:
>>> df_sum.head()
TUCASEID t070101 t070102 t070103 t070104 t070105 t070199 \
0 20030100013280 0 0 0 0 0 0
1 20030100013344 0 0 0 0 0 0
2 20030100013352 60 0 0 0 0 0
3 20030100013848 0 0 0 0 0 0
4 20030100014165 0 0 0 0 0 0
t070201 t070299 shopping year
0 0 0 0 2003
1 0 0 0 2003
2 0 0 60 2003
3 0 0 0 2003
4 0 0 0 2003
>>> emp.head()
TUCASEID status
0 20030100013280 emp
1 20030100013344 emp
2 20030100013352 emp
4 20030100014165 emp
5 20030100014169 emp
这是数据框架,我想在公共列TUCASEID
上加入它们,其中有交叉点:
>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID)
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462,
20131212132469, 20131212132475])
现在...
>>> df_sum.join(emp, on='TUCASEID', how='inner')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join
rsuffix=rsuffix, sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result
rdata.items, rsuf)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object')
嗯,这很奇怪,两个数据框中出现的唯一列是要加入的列,但是,让我们同意[1]:
>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r')
Empty DataFrame
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status]
Index: []
尽管有一个巨大的交叉点。这是怎么回事?
>>> pd.__version__
'0.15.0'
[1]:我实际上对连接列的dtype强制执行整数,因为它在那里说“对象”,没有区别:
>>> emp.dtypes
TUCASEID int64
status object
dtype: object
>>> df_sum.dtypes
TUCASEID int64
(...)
shopping int64
year int64
dtype: object
答案 0 :(得分:2)
df.join
通常会调用pd.merge
(除非在调用concat
的特殊情况下)。因此,join
可以执行任何操作,merge
可以执行此操作
也。虽然可能不严格正确,但我倾向于仅在使用时df.join
加入索引并使用pd.merge
加入列。
因此,我可以重现您描述的问题:
import numpy as np
import pandas as pd
df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)),
index=list('ABCDEF'), columns=list('XY'))
emp = pd.DataFrame(np.arange(6*2).reshape((6,2)),
index=list('ABCDEF'), columns=list('XZ'))
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner'))
# Empty DataFrame
# Columns: [X, Y, X_r, Z]
# Index: []
但pd.merge
按预期工作 - 无需提供rsuffix
:
print(pd.merge(df_sum, emp, on='X')
产量
X Y Z
0 0 1 1
1 2 3 3
2 4 5 5
3 6 7 7
4 8 9 9
5 10 11 11
Under the hood,df_sum.join
调用以这种方式合并:
if isinstance(other, DataFrame):
return merge(self, other, left_on=on, how=how,
left_index=on is None, right_index=True,
suffixes=(lsuffix, rsuffix), sort=sort)
因此,即使您使用df_sum.join(emp, on='...')
,Pandas也会将其转换为pd.merge(df_sum, emp, left_on='...')
。
此外,以这种方式调用时合并为空:
In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True)
Out[228]:
Empty DataFrame
Columns: [X, X_x, Y, X_y, Z]
Index: []
因为left_on='X'
需要on='X'
才能使合并成功:
In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True)
Out[233]:
X Y Z
A 0 1 1
B 2 3 3
C 4 5 5
D 6 7 7
E 8 9 9
F 10 11 11