我有一个以下格式的Spark数据框。
df = spark.createDataFrame([(1, 2, 3), (1, 4, 100), (20, 30, 50)],['a', 'b', 'c'])
df.show()
输入:
我想添加一个新列“ median”作为列“ a”,“ b”,“ c”的中位数。如何在PySpark中做到这一点。
预期输出:
我正在使用Spark 2.3.1
答案 0 :(得分:3)
使用udf
定义用户定义的函数,然后使用withColumn
将指定的列添加到数据框中:
from numpy import median
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
def my_median(a, b, c):
return int(median([int(a),int(b),int(c)]))
udf_median = udf(my_median, IntegerType())
df_t = df.withColumn('median', udf_median(df['a'], df['b'], df['c']))
df_t.show()
答案 1 :(得分:2)
没有内置功能,但是您可以使用现有组件轻松编写一个功能。
# In Spark < 2.4 replace array_sort with sort_array
# Thanks to @RaphaelRoth for pointing that out
from pyspark.sql.functions import array, array_sort, floor, col, size
from pyspark.sql import Column
def percentile(p, *args):
def col_(c):
if isinstance(c, Column):
return c
elif isinstance(c, str):
return col(c)
else:
raise TypeError("args should str or Column, got {}".format(type(c)))
xs = array_sort(array(*[col_(x) for x in args]))
n = size(xs)
h = (n - 1) * p
i = floor(h).cast("int")
x0, x1 = xs[i], xs[i + 1]
return x0 + (h - i) * (x1 - x0)
用法示例:
df.withColumn("median", percentile(0.5, *df.columns)).show()
+---+---+---+------+
| a| b| c|median|
+---+---+---+------+
| 1| 2| 3| 2.0|
| 1| 4|100| 4.0|
| 20| 30| 50| 30.0|
+---+---+---+------+
同一件事可以在 Scala 中完成:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
def percentile(p: Double, args: Column*) = {
val xs = array_sort(array(args: _*))
val n = size(xs)
val h = (n - 1) * p
val i = floor(h).cast("int")
val (x0, x1) = (xs(i), xs(i + 1))
x0 + (h - i) * (x1 - x0)
}
val df = Seq((1, 2, 3), (1, 4, 100), (20, 30, 50)).toDF("a", "b", "c")
df.withColumn("median", percentile(0.5, $"a", $"b", $"c")).show
+---+---+---+------+
| a| b| c|median|
+---+---+---+------+
| 1| 2| 3| 2.0|
| 1| 4|100| 4.0|
| 20| 30| 50| 30.0|
+---+---+---+------+
仅在 Python 中,您可能还会考虑矢量化的UDF-通常,它比内置函数要慢,但比非矢量化的udf
要好:>
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import DoubleType
import pandas as pd
import numpy as np
def pandas_percentile(p=0.5):
assert 0 <= p <= 1
@pandas_udf(DoubleType())
def _(*args):
return pd.Series(np.percentile(args, q = p * 100, axis = 0))
return _
df.withColumn("median", pandas_percentile(0.5)("a", "b", "c")).show()
+---+---+---+------+
| a| b| c|median|
+---+---+---+------+
| 1| 2| 3| 2.0|
| 1| 4|100| 4.0|
| 20| 30| 50| 30.0|
+---+---+---+------+
答案 2 :(得分:1)
我已经稍微修改了OmG的答案,以使UDF动态显示“ n”个列而不是3个。
代码:
df = spark.createDataFrame([(1,2,3),(100,1,10),(30,20,50)],['a','b','c'])
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
def my_median(*args):
return float(np.median(list(args)))
udf_median = udf(my_median, DoubleType())
df.withColumn('median', udf_median('a','b','c')).show()
输出:
答案 3 :(得分:0)
df = spark.createDataFrame([(1,2,3),(1,4,100),(20,30,50)],['a','b','c'])
from pyspark.sql.functions import struct, udf
from pyspark.sql.types import FloatType
import numpy as np
def find_median(values_list):
try:
median = np.median(values_list) #get the median of values in a list in each row
return round(float(median),2)
except Exception:
return None #if there is anything wrong with the given values
median_finder = udf(find_median,FloatType())
df = df.withColumn("List_abc", struct(col('a'),col('b'),col('c')))\
.withColumn("median",median_finder("List_abc")).drop('List_abc')
df.show()
+---+---+---+------+
| a| b| c|median|
+---+---+---+------+
| 1| 2| 3| 2.0|
| 1| 4|100| 4.0|
| 20| 30| 50| 30.0|
+---+---+---+------+