合并两个不同形状的pandas数据帧时出现问题?

时间:2016-11-17 17:24:24

标签: python pandas join merge

这很简单,但我不明白为什么我无法合并两个数据帧。我有以下df个不同形状(一个比另一个更大更宽):

DF1

    A                      id
0   microsoft inc          1
1   apple computer.        2
2   Google Inc.            3
3   IBM                    4
4   amazon, Inc.           5

DF2

    B   C   D   E   id
0   (01780-500-01)  237489 - 342    API     True.   1
0   (409-6043-01)   234324  API     Other   2
0   23423423    API     NaN     NaN     3
0   (001722-5e240-60)   NaN     NaN     Other   4
1   (0012172-52411-60)  32423423.   NaN     Other   4
0   29849032-29482390   API     Yes     False   5
1   329482030-23490-1   API     Yes     False   5

我想在df1列合并df2index

DF3

    A   B   C   D   E   id
0   microsoft inc   (01780-500-01)  237489 - 342    API     True.   1
1   apple computer. (409-6043-01)   234324  API     Other   2
2   Google Inc. 23423423    API     NaN     NaN     3
3   IBM (001722-5e240-60)   NaN     NaN     Other   4
4   IBM (0012172-52411-60)  32423423.   NaN     Other   4
5   amazon, Inc.    29849032-29482390   API     Yes     False   5
6   amazon, Inc.    329482030-23490-1   API     Yes     False   5

我知道可以使用merge()完成此操作。另外,我读了这篇优秀的tutorial并试图:

在:

pd.merge(df1, df2, on=df1.id, how='outer')

输出:

IndexError: indices are out-of-bounds

然后我尝试了:

pd.merge(df2, df1, on='id', how='outer')

显然它重复了几次合并的行,如下所示:

    A   B   C   D   E   index
0   microsoft inc   (01780-500-01)  237489 - 342    API     True.   1
1   apple computer. (409-6043-01)   234324  API     Other   2
2   apple computer. (409-6043-01)   234324  API     Other   2
3   apple computer. (409-6043-01)   234324  API     Other   2
4   apple computer. (409-6043-01)   234324  API     Other   2
5   apple computer. (409-6043-01)   234324  API     Other   2
6   apple computer. (409-6043-01)   234324  API     Other   2
7   apple computer. (409-6043-01)   234324  API     Other   2
8   apple computer. (409-6043-01)   234324  API     Other   2
...

我认为这与我创建时间索引df2['position'] = df2.index的事实有关,因为索引看起来很奇怪,然后将其删除。所以,我的问题是如何获得df3

更新

我修改了df2这样的索引:

df2.reset_index(drop=True, inplace=True)

现在看起来像这样:

    B   C   D   E   id
0   (01780-500-01)  237489 - 342    API     True.   1
1   (409-6043-01)   234324  API     Other   2
2   23423423    API     NaN     NaN     3
3   (001722-5e240-60)   NaN     NaN     Other   4
4   (0012172-52411-60)  32423423.   NaN     Other   4
5   29849032-29482390   API     Yes     False   5
6   329482030-23490-1   API     Yes     False   5

我仍然遇到同样的问题。合并的行重复几次。

>>>print(df2.dtypes)
B    object
C    object
D    object
E    object
id   int64
dtype: object

>>>print(df1.dtypes)
A                object
id               int64
dtype: object

UPDATE2

>>>print(df2['id'])
0        1
1        2
2        3
3        4
4        4
5        5
6        5
7        6
8        6
9        7
10       8
11       8
12       8
13       8
14       9
15      10
16      11
17      11
18      12
19      12
20      13
21      13
22      14
23      15
24      16
25      16
26      17
27      17
28      18
29      18
      ... 
476    132
477    132
478    132
479    132
480    132
481    132
482    132
483    132
484    133
485    133
486    133
487    133
488    134
489    134
490    134
491    134
492    135
493    135
494    136
495    136
496    137
497    137
498    137
499    137
500    137
501    137
502    137
503    138
504    138
505    138
Name: id, dtype: int64

>>>print(df1)

0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10
10     11
11      8
12     12
13      6
14      7
15      8
16      6
17     11
18     13
19     14
20     15
21     11
22      2
23     16
24     17
25     18
26      9
27     19
28     11
29     20
       ..
108    57
109    43
110    22
111     2
112    58
113    49
114    22
115    59
116     2
117     6
118    22
119     2
120    37
121     2
122     9
123    60
124    61
125    62
126    63
127    42
128    64
129     4
130    29
131    11
132     2
133    25
134     4
135    65
136    66
137     4
Name: id, dtype: int64

1 个答案:

答案 0 :(得分:1)

您可以尝试将索引设置为id,然后使用join

df1 = pd.DataFrame([('microsoft inc',1),
('apple computer.',2),
('Google Inc.',3),
('IBM',4),
('amazon, Inc.',5)],columns = ('A','id'))

df2 = pd.DataFrame([('(01780-500-01)','237489', '- 342','API',   1),
('(409-6043-01)','234324', ' API','Other   ',2),
('23423423','API', 'NaN','NaN',     3),
('(001722-5e240-60)','NaN', 'NaN','Other',   4),
('(0012172-52411-60)','32423423','   NaN','Other',   4),
('29849032-29482390','API', '    Yes','     False',   5),
('329482030-23490-1','API', '    Yes','     False',   5)],
columns = ['B','C','D','E','id'])

df1  =df1.set_index('id')
df1.drop_duplicates(inplace=True)
df2  = df2.set_index('id')
df3  = df1.join(df2,how='outer')

由于您已为两个数据帧设置索引列(也称为连接键),因此您不必指定on='id'参数。

这是解决问题的另一种方法。我认为pd.merge(df1, df2, on='id', how='outer')没有任何问题。您可能需要仔细检查两个数据框中的id列,如@JohnE所述