只有外连接python pandas

时间:2017-11-07 20:25:45

标签: python pandas dataframe outer-join

我有两个DataFrame,它们具有相同的列名,包含一些匹配的数据和一些唯一的数据。

我想排除中间内容,只保存两个DataFrames的唯一内容。

我如何连接或合并或加入这两个数据帧呢?

例如在这张图片中我不想要这张图片的中间部分,我想要双方而不是中间:

enter image description here

现在我的代码:

  //Displays the total bonus for an employee
  BonusOnSavings bonus = new BonusOnSavings();
  System.out.printf("The total bonus for this employee is $%,.2f",
                    bonus.computeBonus(commitment, q1, q2, q3, q4) + "\n");

让我举一个例子说明我需要的东西:

def query_to_df(query):
    ...
    df_a = pd.DataFrame(data_a)
    df_b = pd.DataFrame(data_b)
    outer_results = pd.concat([df_a, df_b], axis=1, join='outer')
    return df

或者我对2个数据帧感到满意

df_a = 
col_a  col_b  col_c
   a1     b1     c1
   a2     b2     c2

df_b = 
col_a  col_b  col_c
   a2     b2     c2
   a3     b3     c3

# they only share the 2nd row:    a2     b2     c2 
# so the outer result should be:
col_a  col_b  col_c  col_a  col_b  col_c
   a1     b1     c1     NA     NA     NA
   NA     NA     NA     a3     b3     c3

最后,您会注意到result_1 = col_a col_b col_c a1 b1 c1 result_2 = col_a col_b col_c a3 b3 c3 被排除,因为所有列都匹配 - 如何根据所有列指定我想要加入,而不只是1?如果a2 b2 c2df_a我也希望该行也在a2 foo c2

3 个答案:

答案 0 :(得分:4)

mergeindicator参数一起使用,outer先加入,然后按queryboolean indexing进行过滤:

df = df_a.merge(df_b, how='outer', indicator=True)
print (df)
  col_a col_b col_c      _merge
0    a1    b1    c1   left_only
1    a2    b2    c2        both
2    a3    b3    c3  right_only

a = df.query('_merge == "left_only"').drop('_merge', 1)
print (a)
  col_a col_b col_c
0    a1    b1    c1

b = df.query('_merge == "right_only"').drop('_merge', 1)
print (b)
  col_a col_b col_c
2    a3    b3    c3

或者:

a = df[df['_merge'] == "left_only"].drop('_merge', 1)
print (a)
  col_a col_b col_c
0    a1    b1    c1

b = df[df['_merge'] == "right_only"].drop('_merge', 1)
print (b)
  col_a col_b col_c
2    a3    b3    c3

答案 1 :(得分:4)

使用pd.DataFrame.drop_duplicates
这假设行在各自的数据帧中是唯一的。

df_a.append(df_b).drop_duplicates(keep=False)

  col_a col_b col_c
0    a1    b1    c1
1    a3    b3    c3

您甚至可以使用pd.concat keys参数来提供行所在的上下文。

pd.concat([df_a, df_b], keys=['a', 'b']).drop_duplicates(keep=False)

    col_a col_b col_c
a 0    a1    b1    c1
b 1    a3    b3    c3

答案 2 :(得分:1)

concat和drop_duplicates with keep = False

{% extends 'admin/model/list.html' %}

{% block model_list_table %}
<div class="table-responsive">
<table class="table table-striped table-bordered table-hover model-list">
    <thead>
        <tr>
            {% block list_header scoped %}
                {% if actions %}
                <th class="list-checkbox-column">
                    <input type="checkbox" name="rowtoggle" class="action-rowtoggle" title="{{ _gettext('Select all records') }}" />
                </th>
                {% endif %}
                {% block list_row_actions_header %}
                    {% if admin_view.column_display_actions %}
                    <th class="col-md-1">&nbsp;</th>
                    {% endif %}
                {% endblock %}
                {% for c, name in list_columns %}
                {% set column = loop.index0 %}
                <th class="column-header col-{{c}}">
                    {% if admin_view.is_sortable(c) %}
                        {% if sort_column == column %}
                            <a href="{{ sort_url(column, True) }}" title="{{ _gettext('Sort by %(name)s', name=name) }}">
                                {{ name }}
                                {% if sort_desc %}
                                    <span class="fa fa-chevron-up glyphicon glyphicon-chevron-up"></span>
                                {% else %}
                                    <span class="fa fa-chevron-down glyphicon glyphicon-chevron-down"></span>
                                {% endif %}
                            </a>
                        {% else %}
                            <a href="{{ sort_url(column) }}" title="{{ _gettext('Sort by %(name)s', name=name) }}">{{ name }}</a>
                        {% endif %}
                    {% else %}
                        {{ name }}
                    {% endif %}
                    {% if admin_view.column_descriptions.get(c) %}
                        <a class="fa fa-question-circle glyphicon glyphicon-question-sign"
                           title="{{ admin_view.column_descriptions[c] }}"
                           href="javascript:void(0)" data-role="tooltip"
                        ></a>
                    {% endif %}
                </th>
                {% endfor %}
            {% endblock %}
        </tr>
    </thead>
    {% for row in data %}
    <tr>
        {% block list_row scoped %}
            {% if actions %}
            <td>
                <input type="checkbox" name="rowid" class="action-checkbox" value="{{ get_pk_value(row) }}" title="{{ _gettext('Select record') }}" />
            </td>
            {% endif %}
            {% block list_row_actions_column scoped %}
                {% if admin_view.column_display_actions %}
                <td class="list-buttons-column">
                    {% block list_row_actions scoped %}
                      {% for action in list_row_actions %}
                      {{ action.render_ctx(get_pk_value(row), row) }}
                      {% endfor %}
                    {% endblock %}
                </td>
                {%- endif -%}
            {% endblock %}

            {% for c, name in list_columns %}
                <td class="col-{{c}}">
                {% if admin_view.is_editable(c) %}
                    {% set form = list_forms[get_pk_value(row)] %}
                    {% if form.csrf_token %}
                    {{ form[c](pk=get_pk_value(row), display_value=get_value(row, c), csrf=form.csrf_token._value()) }}
                    {% else %}
                    {{ form[c](pk=get_pk_value(row), display_value=get_value(row, c)) }}
                    {% endif %}
                {% else %}
                {{ get_value(row, c) }}
                {% endif %}
                </td>
            {% endfor %}
        {% endblock %}
    </tr>
    {% else %}
    <tr>
        <td colspan="999">
            {% block empty_list_message %}
            <div class="text-center">
                {{ admin_view.get_empty_list_message() }}
            </div>
            {% endblock %}
        </td>
    </tr>
    {% endfor %}
</table>
</div>

<h3>Summaries</h3>

<div class="table-responsive">
    <table class="table table-striped table-bordered table-hover model-list">
        <thead>
            <tr>
                {% if actions %}
                <th class="list-checkbox-column">
                </th>
                {% endif %}

                <th class="col-md-1"></th>
                {% for c, name in list_columns %}
                    {% set column = loop.index0 %}
                    <th class="column-header col-{{c}}">
                        {{ name }}
                    </th>
                {% endfor %}
            </tr>
        </thead>
        {% for row in summary_data %}
            <tr>
                <td colspan="2"><strong>{{ row['title'] or ''}}</strong></td>
                {% for c, name in list_columns %}
                    <td class="col-{{c}}">
                        {{ row[c] or ''}}
                    </td>
                {% endfor %}
            </tr>
        {% endfor %}
    </table>
</div>


{% block list_pager %}
{% if num_pages is not none %}
{{ lib.pager(page, num_pages, pager_url) }}
{% else %}
{{ lib.simple_pager(page, data|length == page_size, pager_url) }}
{% endif %}
{% endblock %}
{% endblock %}

使用numpy setdiff1

new_df = pd.concat([df_a, df_b]).drop_duplicates(keep=False)

    col_a   col_b   col_c
0   a1      b1      c1
1   a3      b3      c3

DF_A

df_a = pd.DataFrame(np.setdiff1d(np.array(df_a.values), np.array(df_b.values))\
.reshape(-1, df_a.shape[1]), columns = df_a.columns)

df_b = pd.DataFrame(np.setdiff1d(np.array(df_b.values), np.array(df_a.values))\
.reshape(-1, df_b.shape[1]), columns = df_b.columns)

DF_B

    col_a   col_b   col_c
0   a1      b1      c1