我有一个具有以下架构的pyspark数据框:
root
|-- id: integer (nullable = true)
|-- url: string (nullable = true)
|-- cosine_vec: vector (nullable = true)
|-- similar_url: array (nullable = true)
| |-- element: integer (containsNull = true)
similar_url 是包含整数数组的列。这些整数指的是 id 列。
例如:
+----+--------------------+--------------------+--------------------+
| id| url| vec| similar_url|
+----+--------------------+--------------------+--------------------+
| 26|https://url_26......|[0.81382234943025...|[1724, 911, 1262,...|
+----+--------------------+--------------------+--------------------+
我想用 id 1724将行中的 url 中的 similar_url 中的值1724替换。
这是示例。我的问题是我想高效地对每一行执行此操作。
输出看起来像这样:
+----+--------------------+--------------------+--------------------+
| id| url| vec| similar_url|
+----+--------------------+--------------------+--------------------+
| 26|https://url_26......|[0.81382234943025...|[https://url_1724...|
+----+--------------------+--------------------+--------------------+
您有什么想法吗?
答案 0 :(得分:1)
我根据您的说明创建了一个小的示例数据框:
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame(
[
(1, "url_1", [0.3,0.6,], [2,3]),
(2, "url_2", [0.3,0.5,], [1,3]),
(3, "url_3", [0.6,0.5,], [1,2]),
],
["id", "url", "vec", "similar_url"]
)
df.show()
+---+-----+----------+-----------+
| id| url| vec|similar_url|
+---+-----+----------+-----------+
| 1|url_1|[0.3, 0.6]| [2, 3]|
| 2|url_2|[0.3, 0.5]| [1, 3]|
| 3|url_3|[0.6, 0.5]| [1, 2]|
+---+-----+----------+-----------+
如果使用的是> 2.4 spark版本,则有一个名为“ arrays_zip”的函数可用于替换我的UDF:
outType = T.ArrayType(
T.StructType([
T.StructField("vec",T.FloatType(), True),
T.StructField("similar_url",T.IntegerType(), True),
]))
@F.udf(outType)
def arrays_zip(vec, similar_url):
return zip(vec, similar_url)
然后您可以处理数据:
df.withColumn(
"zips",
arrays_zip(F.col("vec"), F.col("similar_url"))
).withColumn(
"zip",
F.explode("zips")
).alias("df").join(
df.alias("df_2"),
F.col("df_2.id") == F.col("df.zip.similar_url")
).groupBy("df.id", "df.url").agg(
F.collect_list("df.zip.vec").alias("vec"),
F.collect_list("df_2.url").alias("similar_url"),
).show()
+---+-----+----------+--------------+
| id| url| vec| similar_url|
+---+-----+----------+--------------+
| 3|url_3|[0.6, 0.5]|[url_1, url_2]|
| 2|url_2|[0.3, 0.5]|[url_1, url_3]|
| 1|url_1|[0.6, 0.3]|[url_3, url_2]|
+---+-----+----------+--------------+
如果您想保留订单,则需要做更多的操作:
@F.udf(T.ArrayType(T.FloatType()))
def get_vec(new_list):
new_list.sort(key=lambda x : x[0])
out_list = [x[1] for x in new_list]
return out_list
@F.udf(T.ArrayType(T.StringType()))
def get_similar_url(new_list):
new_list.sort(key=lambda x : x[0])
out_list = [x[2] for x in new_list]
return out_list
df.withColumn(
"zips",
arrays_zip(F.col("vec"), F.col("similar_url"))
).select(
"id",
"url",
F.posexplode("zips")
).alias("df").join(
df.alias("df_2"),
F.col("df_2.id") == F.col("df.col.similar_url")
).select(
"df.id",
"df.url",
F.struct(
F.col("df.pos").alias("pos"),
F.col("df.col.vec").alias("vec"),
F.col("df_2.url").alias("similar_url"),
).alias("new_struct")
).groupBy(
"id",
"url"
).agg(
F.collect_list("new_struct").alias("new_list")
).select(
"id",
"url",
get_vec(F.col("new_list")).alias("vec"),
get_similar_url(F.col("new_list")).alias("similar_url"),
).show()