Question

我有一个具有不同时间周期（1 / 6、3 / 6、6 / 6等）的列的DF，并且想“分解”所有列以创建一个新的DF，其中每一行是1/6个周期。

from pyspark import Row 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import explode, arrays_zip, col

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

df = spark.createDataFrame([Row(a=1, b=[1, 2, 3, 4, 5, 6], c=[11, 22, 33], d=['foo'])])

|  a|                 b|           c|    d|
+---+------------------+------------+-----+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|
+---+------------------+------------+-----+

我正在爆炸：

df2 = (df.withColumn("tmp", arrays_zip("b", "c", "d"))
       .withColumn("tmp", explode("tmp"))
       .select("a", col("tmp.b"), col("tmp.c"), "d"))

但是输出不是我想要的：

|  a|  b|   c|    d|
+---+---+----+-----+
|  1|  1|  11|[foo]|
|  1|  2|  22|[foo]|
|  1|  3|  33|[foo]|
|  1|  4|null|[foo]|
|  1|  5|null|[foo]|
|  1|  6|null|[foo]|
+---+---+----+-----+

我希望它看起来像这样：

|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1| 11|foo|
|   |  2|   |   |
|   |  3| 22|   |
|   |  4|   |   |
|   |  5| 33|   |
|   |  6|   |   |
+---+---+---+---+

我是Spark的新手，从一开始我就有很多复杂的话题！：）

更新2019-07-15 ：也许有人在不使用UDF的情况下找到了解决方案？ ->由@jxc回答

更新2019-07-17 ：也许有人有解决方案，如何以更复杂的顺序更改空<->值序列？就像在c-Null, 11, Null, 22, Null, 33或更复杂的情况下一样，我们希望在列d中将第一个值设置为Null，然后是下一个foo然后是Null, Null, Null：

|  a|  b|  c|  d|
+---+---+---+---+
|  1|  1|   |   |
|   |  2| 11|foo|
|   |  3|   |   |
|   |  4| 22|   |
|   |  5|   |   |
|   |  6| 33|   |
+---+---+---+---+

Answer 1

这是不使用udf的一种方法：

于2019/07/17更新：：调整了SQL stmt，并将N = 6作为参数添加到SQL。

更新于2019/07/16：：删除了临时列t，并在 transform 函数中用常量array(0,1,2,3,4,5)代替。在这种情况下，我们可以直接对数组元素的值进行操作，而不是对它们的索引进行操作。

更新：我删除了使用String函数并将数组元素中的数据类型全部转换为String且效率较低的原始方法。带有Spark 2.4+的Spark SQL高阶函数应该比原始方法更好。

设置

from pyspark.sql import functions as F, Row

df = spark.createDataFrame([ Row(a=1, b=[1, 2, 3, 4, 5, 6], c=['11', '22', '33'], d=['foo'], e=[111,222]) ])

>>> df.show()
+---+------------------+------------+-----+----------+
|  a|                 b|           c|    d|         e|
+---+------------------+------------+-----+----------+
|  1|[1, 2, 3, 4, 5, 6]|[11, 22, 33]|[foo]|[111, 222]|
+---+------------------+------------+-----+----------+

# columns you want to do array-explode
cols = df.columns

# number of array elements to set
N = 6

使用SQL高阶函数：transform

使用Spark SQL高阶函数：transform（），执行以下操作：

创建以下Spark SQL代码，其中将 {0} 替换为column_name，将 {1} 替换为 N ：

stmt = '''
   CASE
      WHEN '{0}' in ('d') THEN
        transform(sequence(0,{1}-1), x -> IF(x == 1, `{0}`[0], NULL))
      WHEN size(`{0}`) <= {1}/2 AND size(`{0}`) > 1 THEN
        transform(sequence(0,{1}-1), x -> IF(((x+1)*size(`{0}`))%{1} == 0, `{0}`[int((x-1)*size(`{0}`)/{1})], NULL))
      ELSE `{0}`
    END AS `{0}`
'''

注意：仅当数组包含多个元素（除非在单独的WHEN子句中指定）和 <= N/2 元素（在此示例中，1 < size <= 3）。其他大小的数组将保持原样。

对所有必需列使用 selectExpr（）运行上述SQL

df1 = df.withColumn('a', F.array('a')) \
        .selectExpr(*[ stmt.format(c,N) for c in cols ])

>>> df1.show()
+---+------------------+----------------+-----------+---------------+
|  a|                 b|               c|          d|              e|
+---+------------------+----------------+-----------+---------------+
|[1]|[1, 2, 3, 4, 5, 6]|[, 11,, 22,, 33]|[, foo,,,,]|[,, 111,,, 222]|
+---+------------------+----------------+-----------+---------------+

运行 arrays_zip 和爆炸：

df_new = df1.withColumn('vals', F.explode(F.arrays_zip(*cols))) \
            .select('vals.*') \
            .fillna('', subset=cols)

>>> df_new.show()
+----+---+---+---+----+
|   a|  b|  c|  d|   e|
+----+---+---+---+----+
|   1|  1|   |   |null|
|null|  2| 11|foo|null|
|null|  3|   |   | 111|
|null|  4| 22|   |null|
|null|  5|   |   |null|
|null|  6| 33|   | 222|
+----+---+---+---+----+

注意：fillna('', subset=cols)仅更改了包含字符串的列

在一个方法链中：

df_new = df.withColumn('a', F.array('a')) \
           .selectExpr(*[ stmt.format(c,N) for c in cols ]) \
           .withColumn('vals', F.explode(F.arrays_zip(*cols))) \
           .select('vals.*') \
           .fillna('', subset=cols)

使用转换功能的说明：

转换功能（下面列出，反映了对要求的旧修订）

transform(sequence(0,5), x -> IF((x*size({0}))%6 == 0, {0}[int(x*size({0})/6)], NULL))

如文章中所述， {0} 将替换为列名。这里我们以列-c为例，其中包含3个元素：

在转换函数中，sequence(0,5)创建一个包含6个元素的常数数组array(0,1,2,3,4,5)，其余的将lambda函数设置为一个具有元素值的参数x。
IF（condition，true_value，false_value）：是标准的SQL函数

我们应用的条件是： (x*size(c))%6 == 0 其中size(c)=3，如果此条件为true，它将返回 c [int（x * size （c）/ 6）] ，否则，返回 NULL 。因此对于x从0到5，我们将有：

((0*3)%6)==0) true   -->  c[int(0*3/6)] = c[0]
((1*3)%6)==0) false  -->  NULL
((2*3)%6)==0) true   -->  c[int(2*3/6)] = c[1]
((3*3)%6)==0) false  -->  NULL
((4*3)%6)==0) true   -->  c[int(4*3/6)] = c[2]
((5*3)%6)==0) false  -->  NULL

类似于包含2个元素的数组的e列。

Answer 2

要获得输出，您必须将col a更改为数组并将空值插入c数组。

from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql.functions import explode, arrays_zip, col, array

def fillArrayVals(a):
  for i in [1,3,5]:
    a.insert(i,None)
  return a

fillArrayValsUdf = udf(fillArrayVals, ArrayType(IntegerType(), True))    

df = spark.createDataFrame([Row(a=1, b=[1, 2, 3, 4, 5, 6], c=[11, 22, 33], d=['foo'])])
df = df.withColumn("a", array(col("a"))).withColumn("c", updateArrayUdf("c"))
df = df.withColumn("tmp", arrays_zip("a","b", "c", "d"))\
   .withColumn("tmp", explode("tmp"))\
   .select(col("tmp.a"), col("tmp.b"), col("tmp.c"), col("tmp.d"))

上面的代码产生的结果是，您可以将其转换为字符串以显示空值而不是null

+----+---+----+----+
|   a|  b|   c|   d|
+----+---+----+----+
|   1|  1|  11| foo|
|null|  2|null|null|
|null|  3|  22|null|
|null|  4|null|null|
|null|  5|  33|null|
|null|  6|null|null|
+----+---+----+----+

如何爆炸多列，不同类型和不同长度的列？

2 个答案:

设置

使用SQL高阶函数：transform

在一个方法链中：

使用转换功能的说明：