Pyspark:2个键联接,根据条件生成列表列

时间:2020-09-04 12:01:21

标签: pyspark

让我们说我有两个桌子

+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|source_ip  |destination_ip|source_port|destination_port|source_packets|destination_packets|timestampGMT             |
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|192.168.1.1|10.0.0.1      |22         |51000           |17            |1                  |2017-03-10T15:27:18+00:00|
|192.168.1.2|10.0.0.2      |51000      |22              |1             |2                  |2017-03-15T12:27:18+00:00|
|192.168.1.2|10.0.0.2      |53         |51000           |2             |3                  |2017-03-15T12:28:18+00:00|
|192.168.1.2|10.0.0.2      |51000      |53              |3             |4                  |2017-03-15T12:29:18+00:00|
|192.168.1.3|10.0.0.3      |80         |51000           |4             |5                  |2017-03-15T12:28:18+00:00|
|192.168.1.3|10.0.0.3      |51000      |80              |5             |6                  |2017-03-15T12:29:18+00:00|
|192.168.1.3|10.0.0.3      |22         |51000           |25            |7                  |2017-03-18T11:27:18+00:00|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+

+-----------+------+
|ip         |label |
+-----------+------+
|192.168.1.1|Router|
|10.0.0.3   |Server|
|1.2.3.4    |Client|
+-----------+------+

如何有效地联接两个表,以便将匹配source_ip或destination_ip的标签存储为元组或数组(label [0] = source_ip,label [1] = destination_ip)

+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|source_ip  |destination_ip|source_port|destination_port|source_packets|destination_packets|timestampGMT             |label|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|192.168.1.1|10.0.0.1      |22         |51000           |17            |1                  |2017-03-10T15:27:18+00:00|[Router, None]
|192.168.1.2|10.0.0.2      |51000      |22              |1             |2                  |2017-03-15T12:27:18+00:00|[None, None]
|192.168.1.2|10.0.0.2      |53         |51000           |2             |3                  |2017-03-15T12:28:18+00:00|[None, None]
|192.168.1.2|10.0.0.2      |51000      |53              |3             |4                  |2017-03-15T12:29:18+00:00|[None, None]
|192.168.1.3|10.0.0.3      |80         |51000           |4             |5                  |2017-03-15T12:28:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3      |51000      |80              |5             |6                  |2017-03-15T12:29:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3      |22         |51000           |25            |7                  |2017-03-18T11:27:18+00:00|[None, Server]
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+

我只想返回符合条件的行

  • 标签中包含服务器或客户端的标签应保留
  • [无,无]应该被删除(在第二张表中未定义source_ip和destination_ip的条件)

应该是这样的

+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|source_ip  |destination_ip|source_port|destination_port|source_packets|destination_packets|timestampGMT             |label|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|192.168.1.3|10.0.0.3      |80         |51000           |4             |5                  |2017-03-15T12:28:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3      |51000      |80              |5             |6                  |2017-03-15T12:29:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3      |22         |51000           |25            |7                  |2017-03-18T11:27:18+00:00|[None, Server]
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+

1 个答案:

答案 0 :(得分:0)

加入两次并选择您的列。

df1.join(df2.withColumnRenamed('label', 'label1'), col('source_ip') == col('ip'), 'left').drop('ip') \
   .join(df2.withColumnRenamed('label', 'label2'), col('destination_ip') == col('ip'), 'left').drop('ip') \
   .filter("label1 in ('Server', 'Client') or label2 in ('Server', 'Client')") \
   .withColumn('label', array('label1', 'label2')) \
   .select(*df1.columns, 'label') \
   .show() 

+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------+----------+
|  source_ip|destination_ip|source_port|destination_port|source_packets|destination_packets|       timestampGMT|     label|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------+----------+
|192.168.1.3|      10.0.0.3|         80|           51000|             4|                  5|2017-03-15 12:28:18|[, Server]|
|192.168.1.3|      10.0.0.3|      51000|              80|             5|                  6|2017-03-15 12:29:18|[, Server]|
|192.168.1.3|      10.0.0.3|         22|           51000|            25|                  7|2017-03-18 11:27:18|[, Server]|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------+----------+