为什么通过公共列合并两个DataFrame会产生空结果?

时间:2017-05-09 22:30:36

标签: python pandas

我正在使用调查中的数据处理两个DataFrame对象,但我无法正确合并它们。结构看起来像这样:

In [93]: numeric_answers
Out[93]: 
   ANSWER_COUNT RESPONSE
1            50        1
2            21        2
4             3        4


In [94]: readable_values
Out[94]: 
                                                    MEANING
RESPONSE                                                   
 1                                                     male
 2                                                   female
 3                                              transgender
 5        non-binary, genderqueer, or gender non-conforming
 6                    a different identity (please specify)
 4                                   prefer not to disclose
-9                                             Not answered

我的目标是:

  • 使用RESPONSE
  • 合并它们
  • 生成包含['RESPONSE', 'MEANING', 'ANSWER_COUNT']
  • 的DataFrame
  • 缺少值设置为N/A(尽管0也可以)

所需输出的示例:

RESPONSE                                        MEANING  ANSWER_COUNT
   1                                               male           50
   2                                             female           21
   3                                        transgender           NaN
   5  non-binary, genderqueer, or gender non-conforming           NaN
   6              a different identity (please specify)           NaN
   4                             prefer not to disclose           3
  -9                                       Not answered           NaN

阅读merge的文档后,我得知我需要的是pd.merge(readable_values, numeric_answers),但此操作会产生一个空结果:

Empty DataFrame
Columns: [RESPONSE, MEANING, ANSWER_COUNT]
Index: []

经过各种论证的尝试后,merge(readable_values, numeric_answers, on='RESPONSE', how='outer')得到了一个有希望的结果:

(Pdb) pd.merge(readable_values, numeric_answers, on='RESPONSE', how='outer')
   RESPONSE                                            MEANING  ANSWER_COUNT
0       1.0                                               male           NaN
1       2.0                                             female           NaN
2       3.0                                        transgender           NaN
3       5.0  non-binary, genderqueer, or gender non-conforming           NaN
4       6.0              a different identity (please specify)           NaN
5       4.0                             prefer not to disclose           NaN
6      -9.0                                       Not answered           NaN
7       1.0                                                NaN          50.0
8       2.0                                                NaN          21.0
9       4.0                                                NaN           3.0

但是,它通过附加值进行合并,而我需要使用RESPONSE交叉条目。 Pandas实现这一目标的思想方法是什么?

1 个答案:

答案 0 :(得分:3)

readable_values将RESPONSE作为索引,而不是列 您可以将合并视为:

In [11]: numeric_answers.merge(readable_values, left_on='RESPONSE', right_index=True, how='outer')
Out[11]:
   ANSWER_COUNT  RESPONSE                                            MEANING
1          50.0         1                                               male
2          21.0         2                                             female
4           3.0         4                             prefer not to disclose
4           NaN         3                                        transgender
4           NaN         5  non-binary, genderqueer, or gender non-conforming
4           NaN         6              a different identity (please specify)
4           NaN        -9                                       Not answered

另一种选择是先reset_index readable_values

In [12]: numeric_answers.merge(readable_values.reset_index(), on='RESPONSE', how='outer')
Out[12]:
   ANSWER_COUNT  RESPONSE                                            MEANING
0          50.0         1                                               male
1          21.0         2                                             female
2           3.0         4                             prefer not to disclose
3           NaN         3                                        transgender
4           NaN         5  non-binary, genderqueer, or gender non-conforming
5           NaN         6              a different identity (please specify)
6           NaN        -9                                       Not answered

请注意您在渲染方式上可以看到的区别:

In [21]: readable_values
Out[21]:
                                                    MEANING
RESPONSE
 1                                                     male
 2                                                   female
 3                                              transgender
 5        non-binary, genderqueer, or gender non-conforming
 6                    a different identity (please specify)
 4                                   prefer not to disclose
-9                                             Not answered

In [22]: readable_values.reset_index()  # RESPONSE is now a column
Out[22]:
   RESPONSE                                            MEANING
0         1                                               male
1         2                                             female
2         3                                        transgender
3         5  non-binary, genderqueer, or gender non-conforming
4         6              a different identity (please specify)
5         4                             prefer not to disclose
6        -9                                       Not answered