Pandas join:不承认加入专栏

时间:2015-01-31 22:18:13

标签: python join pandas inner-join

我不知道发生了什么,标题只是一阶近似。我正在尝试加入两个数据框:

>>> df_sum.head()
         TUCASEID  t070101  t070102  t070103  t070104  t070105  t070199  \
0  20030100013280        0        0        0        0        0        0   
1  20030100013344        0        0        0        0        0        0   
2  20030100013352       60        0        0        0        0        0   
3  20030100013848        0        0        0        0        0        0   
4  20030100014165        0        0        0        0        0        0   

   t070201  t070299  shopping  year  
0        0        0         0  2003  
1        0        0         0  2003  
2        0        0        60  2003  
3        0        0         0  2003  
4        0        0         0  2003  
>>> emp.head()
         TUCASEID status
0  20030100013280    emp
1  20030100013344    emp
2  20030100013352    emp
4  20030100014165    emp
5  20030100014169    emp

这是数据框架,我想在公共列TUCASEID上加入它们,其中有交叉点:

>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID)
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462,
       20131212132469, 20131212132475])

现在...

>>> df_sum.join(emp, on='TUCASEID', how='inner')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join
    rsuffix=rsuffix, sort=sort)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat
    suffixes=(lsuffix, rsuffix), sort=sort)
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result
    rdata.items, rsuf)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix
    to_rename)
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object')

嗯,这很奇怪,两个数据框中出现的唯一列是要加入的列,但是,让我们同意[1]:

>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r')
Empty DataFrame
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status]
Index: []

尽管有一个巨大的交叉点。这是怎么回事?

>>> pd.__version__
'0.15.0'

[1]:我实际上对连接列的dtype强制执行整数,因为它在那里说“对象”,没有区别:

>>> emp.dtypes
TUCASEID     int64
status      object
dtype: object
>>> df_sum.dtypes
TUCASEID    int64
(...)
shopping    int64
year        int64
dtype: object

1 个答案:

答案 0 :(得分:2)

df.join通常会调用pd.merge(除非在调用concat的特殊情况下)。因此,join可以执行任何操作,merge可以执行此操作 也。虽然可能不严格正确,但我倾向于仅在使用时df.join 加入索引并使用pd.merge加入列。

因此,我可以重现您描述的问题:

import numpy as np
import pandas as pd

df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)), 
                      index=list('ABCDEF'), columns=list('XY'))
emp =  pd.DataFrame(np.arange(6*2).reshape((6,2)), 
                    index=list('ABCDEF'), columns=list('XZ'))
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner'))

# Empty DataFrame
# Columns: [X, Y, X_r, Z]
# Index: []

pd.merge按预期工作 - 无需提供rsuffix

print(pd.merge(df_sum, emp, on='X')

产量

    X   Y   Z
0   0   1   1
1   2   3   3
2   4   5   5
3   6   7   7
4   8   9   9
5  10  11  11

Under the hooddf_sum.join调用以这种方式合并:

    if isinstance(other, DataFrame):
        return merge(self, other, left_on=on, how=how,
                     left_index=on is None, right_index=True,
                     suffixes=(lsuffix, rsuffix), sort=sort)

因此,即使您使用df_sum.join(emp, on='...'),Pandas也会将其转换为pd.merge(df_sum, emp, left_on='...')。 此外,以这种方式调用时合并为空:

In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True)
Out[228]: 
Empty DataFrame
Columns: [X, X_x, Y, X_y, Z]
Index: []

因为left_on='X'需要on='X'才能使合并成功:

In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True)
Out[233]: 
    X   Y   Z
A   0   1   1
B   2   3   3
C   4   5   5
D   6   7   7
E   8   9   9
F  10  11  11