如何在spark

时间:2017-12-01 20:06:59

标签: apache-spark apache-spark-sql spark-dataframe lookup apache-spark-dataset

我在spark中有三个数据帧,并希望根据某些条件从一个数据帧中提取值到另一个数据帧。以下是我的情景。有人可以帮助我吗?

DF1:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#aaa;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#333;background-color:#fff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#fff;background-color:#f38630;}
.tg .tg-j2zy{background-color:#FCFBE3;vertical-align:top}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
.tg .tg-yq6s{background-color:#FCFBE3;text-align:center;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-baqh">person_id</th>
    <th class="tg-yw4l">criterion_name_1</th>
    <th class="tg-yw4l">criterion_id_1</th>
    <th class="tg-yw4l">criterion_name_2</th>
    <th class="tg-yw4l">criterion_id_2</th>
    <th class="tg-baqh">criterion_name_3</th>
    <th class="tg-yw4l">criterion_id_3</th>
    <th class="tg-yw4l">criterion_name_4</th>
    <th class="tg-yw4l">criterion_id_4</th>
    <th class="tg-yw4l">criterion_name_5</th>
    <th class="tg-yw4l">criterion_id_5</th>
  </tr>
  <tr>
    <td class="tg-yq6s">100</td>
    <td class="tg-j2zy">Condition</td>
    <td class="tg-j2zy">A-363-3015</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-yq6s">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
  </tr>
  <tr>
    <td class="tg-baqh">101</td>
    <td class="tg-yw4l">Condition</td>
    <td class="tg-yw4l">D-229-3007</td>
    <td class="tg-yw4l">Condition</td>
    <td class="tg-yw4l">A-229-3008</td>
    <td class="tg-baqh">Condition</td>
    <td class="tg-yw4l">D-229-3008</td>
    <td class="tg-yw4l">Condition</td>
    <td class="tg-yw4l">A-229-3009</td>
    <td class="tg-yw4l">Condition</td>
    <td class="tg-yw4l">D-229-3009</td>
  </tr>
  <tr>
    <td class="tg-yq6s">102</td>
    <td class="tg-j2zy">Condition</td>
    <td class="tg-j2zy">A-229-3012</td>
    <td class="tg-j2zy">Observation</td>
    <td class="tg-j2zy">PZXC</td>
    <td class="tg-yq6s">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
    <td class="tg-j2zy">null</td>
  </tr>
</table>

除了这个DF,我还有2个查找数据帧1.条件DF和2.观察DF

  1. 条件DF:

    +-----+--------------+------+
    | id  | condition_id | code |
    +-----+--------------+------+
    | 100 | A-363-3015   | xyz  |
    +-----+--------------+------+
    | 101 | A-334-3015   | pqr  |
    +-----+--------------+------+
    
  2. 观察DF:

  3. +-----+----------------+------+
    | id  | observation_id | code |
    +-----+----------------+------+
    | 100 | PZXC           | 123  |
    +-----+----------------+------+
    | 101 | P2WZX          | pw32 |
    +-----+----------------+------+
    

    我希望最终的DF具有以下结构,并且该值将来自查找DF的此DF。

    |person_id|criterion_name_1|criterion_id_1|criterion_value_1|criterion_name_2|criterion_id_2|criterion_value_2|criterion_name_3|criterion_id_3|criterion_value_3|criterion_name_4|criterion_id_4|criterion_value_4|criterion_name_5|criterion_id_5|criterion_value_5|
    

    列的上述DF结构值为criterion_value_1,criterion_value_2,criterion_value_3 ..... criterion_value_5将出现以下情况。

    如果criterion_name_1 = condition那么它将查找条件DF并将值criterion_id_1与条件DF的condition_code列匹配并获取criterion_value_1列中的代码列的值,它将执行此操作对于所有相应的criterion_name最多5个。

    同样适用于criterion_name_1 =使用观察查找DF进行观察。

0 个答案:

没有答案