我正在使用pyspark解析大量数据。我有一个包含以下列的数据框
ip_address
device_id
location
device_type
我想创建一个名为id
的新列,并为满足以下条件之一的列分配相同的id
值
1)它们具有相同的device_id
和ip_address
2)它们具有相同的device_id
,location
和device_type
3)它们具有相同的ip_address
,location
和device_type
基本上,我想根据上述条件找到所有代表同一设备的行,并为其赋予相同的ID
所以可以说我有以下几列
+--------+-----------+------------+-----------+-------------+
| number | device_id | ip_address | location | device_type |
+--------+-----------+------------+-----------+-------------+
| 1 | device1 | ip1 | location1 | type1 |
| 2 | device1 | ip1 | location1 | type1 |
| 3 | device1 | ip2 | location1 | type1 |
| 4 | device2 | ip1 | location1 | type1 |
| 5 | device3 | ip3 | location2 | type2 |
+--------+-----------+------------+-----------+-------------+
应为前4行分配相同的id
,因为每一行都符合以下三个条件之一。
第1行和第2行满足条件1
第2行和第3行满足条件2
第3行和第4行满足条件3
所以输出应该是
+--------+-----------+------------+-----------+-------------+----+
| number | device_id | ip_address | location | device_type | id |
+--------+-----------+------------+-----------+-------------+----+
| 1 | device1 | ip1 | location1 | type1 | 1 |
| 2 | device1 | ip1 | location1 | type1 | 1 |
| 3 | device1 | ip2 | location1 | type1 | 1 |
| 4 | device2 | ip1 | location1 | type1 | 1 |
| 5 | device3 | ip3 | location2 | type2 | 2 |
+--------+-----------+------------+-----------+-------------+----+
这甚至有可能实现吗?如果是这样,我该怎么做?
答案 0 :(得分:2)
您可以这样做。不知道它是否是理想的方法,但是它是否有效:
df = spark.createDataFrame([
("1" , "device1" , "ip1" , "location1" , "type1"),
("2" , "device1" , "ip1" , "location1" , "type1"),
("3" , "device1" , "ip2" , "location1" , "type1"),
("4" , "device2" , "ip1" , "location1" , "type1"),
("5" , "device3" , "ip3" , "location2" , "type2")
], ("ip_address", "device_id", "location", "device_type"))
df1 = df.groupBy("device_id","ip_address").agg(min(col("number"))).select(col("device_id").alias("d_id"), col("ip_address").alias("ip"), col("min(number)").alias("id1"))
df2 = df.groupBy("device_id","location","device_type").agg(min(col("number"))).select(col("device_id").alias("d_id"), col("location").alias("l"), col("device_type").alias("d_type"), col("min(number)").alias("id2"))
df3 = df.groupBy("ip_address","location","device_type").agg(min(col("number"))).select(col("ip_address").alias("ip"), col("location").alias("l"), col("device_type").alias("d_type"), col("min(number)").alias("id3"))
df.join(df1, [(df1.d_id == df.device_id) & (df1.ip == df.ip_address)], how="inner").select("number","device_id","ip_address","location","device_type","id1").join(df2, [(df2.d_id == df.device_id) & (df2.l == df.location) & (df2.d_type == df.device_type)], how="inner").select("number","device_id","ip_address","location","device_type","id1","id2").join(df3, [(df3.ip == df.ip_address) & (df3.l == df.location) & (df3.d_type == df.device_type)], how="inner").select("number","device_id","ip_address","location","device_type","id1","id2","id3").withColumn("id",least(col("id1"),col("id2"),col("id3"))).show()
加入条件代表您所需的条件。结果在最后的id
列中,如下所示:
+------+---------+----------+---------+-----------+---+---+---+---+
|number|device_id|ip_address| location|device_type|id1|id2|id3| id|
+------+---------+----------+---------+-----------+---+---+---+---+
| 5 | device3 | ip3 |location2| type2 | 5 | 5 | 5 | 5 |
| 3 | device1 | ip2 |location1| type1 | 3 | 1 | 3 | 1 |
| 4 | device2 | ip1 |location1| type1 | 4 | 4 | 1 | 1 |
| 1 | device1 | ip1 |location1| type1 | 1 | 1 | 1 | 1 |
| 2 | device1 | ip1 |location1| type1 | 1 | 1 | 1 | 1 |
+------+---------+----------+---------+-----------+---+---+---+---+