让我们说我有两个桌子
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|source_ip |destination_ip|source_port|destination_port|source_packets|destination_packets|timestampGMT |
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|192.168.1.1|10.0.0.1 |22 |51000 |17 |1 |2017-03-10T15:27:18+00:00|
|192.168.1.2|10.0.0.2 |51000 |22 |1 |2 |2017-03-15T12:27:18+00:00|
|192.168.1.2|10.0.0.2 |53 |51000 |2 |3 |2017-03-15T12:28:18+00:00|
|192.168.1.2|10.0.0.2 |51000 |53 |3 |4 |2017-03-15T12:29:18+00:00|
|192.168.1.3|10.0.0.3 |80 |51000 |4 |5 |2017-03-15T12:28:18+00:00|
|192.168.1.3|10.0.0.3 |51000 |80 |5 |6 |2017-03-15T12:29:18+00:00|
|192.168.1.3|10.0.0.3 |22 |51000 |25 |7 |2017-03-18T11:27:18+00:00|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
+-----------+------+
|ip |label |
+-----------+------+
|192.168.1.1|Router|
|10.0.0.3 |Server|
|1.2.3.4 |Client|
+-----------+------+
如何有效地联接两个表,以便将匹配source_ip或destination_ip的标签存储为元组或数组(label [0] = source_ip,label [1] = destination_ip)
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|source_ip |destination_ip|source_port|destination_port|source_packets|destination_packets|timestampGMT |label|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|192.168.1.1|10.0.0.1 |22 |51000 |17 |1 |2017-03-10T15:27:18+00:00|[Router, None]
|192.168.1.2|10.0.0.2 |51000 |22 |1 |2 |2017-03-15T12:27:18+00:00|[None, None]
|192.168.1.2|10.0.0.2 |53 |51000 |2 |3 |2017-03-15T12:28:18+00:00|[None, None]
|192.168.1.2|10.0.0.2 |51000 |53 |3 |4 |2017-03-15T12:29:18+00:00|[None, None]
|192.168.1.3|10.0.0.3 |80 |51000 |4 |5 |2017-03-15T12:28:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3 |51000 |80 |5 |6 |2017-03-15T12:29:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3 |22 |51000 |25 |7 |2017-03-18T11:27:18+00:00|[None, Server]
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
我只想返回符合条件的行
应该是这样的
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|source_ip |destination_ip|source_port|destination_port|source_packets|destination_packets|timestampGMT |label|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
|192.168.1.3|10.0.0.3 |80 |51000 |4 |5 |2017-03-15T12:28:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3 |51000 |80 |5 |6 |2017-03-15T12:29:18+00:00|[None, Server]
|192.168.1.3|10.0.0.3 |22 |51000 |25 |7 |2017-03-18T11:27:18+00:00|[None, Server]
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------------+
答案 0 :(得分:0)
加入两次并选择您的列。
df1.join(df2.withColumnRenamed('label', 'label1'), col('source_ip') == col('ip'), 'left').drop('ip') \
.join(df2.withColumnRenamed('label', 'label2'), col('destination_ip') == col('ip'), 'left').drop('ip') \
.filter("label1 in ('Server', 'Client') or label2 in ('Server', 'Client')") \
.withColumn('label', array('label1', 'label2')) \
.select(*df1.columns, 'label') \
.show()
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------+----------+
| source_ip|destination_ip|source_port|destination_port|source_packets|destination_packets| timestampGMT| label|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------+----------+
|192.168.1.3| 10.0.0.3| 80| 51000| 4| 5|2017-03-15 12:28:18|[, Server]|
|192.168.1.3| 10.0.0.3| 51000| 80| 5| 6|2017-03-15 12:29:18|[, Server]|
|192.168.1.3| 10.0.0.3| 22| 51000| 25| 7|2017-03-18 11:27:18|[, Server]|
+-----------+--------------+-----------+----------------+--------------+-------------------+-------------------+----------+