Question

我正在使用pyspark，并想要转换此Spark数据框：

    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | TS | ABC[0].VAL.VAL[0].UNT[0].sth1 | ABC[0].VAL.VAL[0].UNT[1].sth1 | ABC[0].VAL.VAL[1].UNT[0].sth1 | ABC[0].VAL.VAL[1].UNT[1].sth1 | ABC[0].VAL.VAL[0].UNT[0].sth2 | ABC[0].VAL.VAL[0].UNT[1].sth2 | ABC[0].VAL.VAL[1].UNT[0].sth2 | ABC[0].VAL.VAL[1].UNT[1].sth2 |
    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | 1  | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    | some_value                    |
    +----+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+

对此：

+----+-----+-----+------------+------------+
| TS | VAL | UNT |    sth1    |    sth2    |
+----+-----+-----+------------+------------+
|  1 |   0 |   0 | some_value | some_value |
|  1 |   0 |   1 | some_value | some_value |
|  1 |   1 |   0 | some_value | some_value |
|  1 |   1 |   1 | some_value | some_value |
+----+-----+-----+------------+------------+

有什么主意，我可以通过花哨的变换来做到这一点吗？

编辑：所以这就是我可以解决的方法：

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))

输出：

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

如何做得更好？

Answer 1

这就是我可以解决的方法：

from pyspark.sql.functions import array, col, explode, struct, lit
import re

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1", "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2", "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"]) 

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)

cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in ["TS"]))
kvs = explode(array([struct(
                            lit( re.search(re.compile(r"VAL\[(\d{1,2})\]"),c).group(1) ).alias("VAL"),
                            lit( re.search(re.compile(r"UNT\[(\d{1,2})\]"),c).group(1) ).alias("UNT"),
                            lit( re.search(re.compile(r"([^_]+$)"),c).group(1) ).alias("Parameter"),
                            col(c).alias("data")) for c in cols
                    ])).alias("kvs")

display(df.select(["TS"] + [kvs]).select(["TS"] + ["kvs.VAL", "kvs.UNT", "kvs.Parameter", "kvs.data"]).groupBy("TS","VAL","UNT").pivot("Parameter").sum("data").orderBy("TS","VAL","UNT"))

输出：

+----+-----+-----+------+------+
| TS | VAL | UNT | sth1 | sth2 |
+----+-----+-----+------+------+
|  1 |   0 |   0 |    0 |  0.7 |
|  1 |   0 |   1 |  0.6 |  0.2 |
|  1 |   1 |   0 |  0.1 |  0.4 |
|  1 |   1 |   1 |  0.4 |  0.1 |
|  2 |   0 |   0 |  0.6 |  0.8 |
|  2 |   0 |   1 |  0.7 |  0.3 |
|  2 |   1 |   0 |  0.1 |  0.1 |
|  2 |   1 |   1 |  0.5 |  0.3 |
+----+-----+-----+------+------+

现在至少告诉我如何做得更好...

Answer 2

您的方法是好的（赞成）。我唯一要做的就是在一个正则表达式搜索中从列名称中提取必要的部分。我还会删除多余的select，而推荐使用groupBy，但这并不重要。

import re

from pyspark.sql.functions import lit, explode, array, struct, col

df = sc.parallelize([(1, 0.0, 0.6, 0.1, 0.4, 0.7, 0.2, 0.4, 0.1), (2, 0.6, 0.7, 0.1, 0.5, 0.8, 0.3, 0.1, 0.3)]).toDF(
    ["TS", "ABC[0].VAL.VAL[0].UNT[0].sth1", "ABC[0].VAL.VAL[0].UNT[1].sth1", "ABC[0].VAL.VAL[1].UNT[0].sth1",
     "ABC[0].VAL.VAL[1].UNT[1].sth1", "ABC[0].VAL.VAL[0].UNT[0].sth2", "ABC[0].VAL.VAL[0].UNT[1].sth2",
     "ABC[0].VAL.VAL[1].UNT[0].sth2", "ABC[0].VAL.VAL[1].UNT[1].sth2"])

newcols = list(map(lambda x: x.replace(".", "_"), df.columns))
df = df.toDF(*newcols)


def extract_indices_and_label(column_name):
    s = re.match(r"\D+\d+\D+(\d+)\D+(\d+)[^_]_(.*)$", column_name)
    m, n, label = s.groups()
    return int(m), int(n), label


def create_struct(column_name):
    val, unt, label = extract_indices_and_label(column_name)
    return struct(lit(val).alias("val"),
                  lit(unt).alias("unt"),
                  lit(label).alias("label"),
                  col(column_name).alias("value"))


df2 = (df.select(
    df.TS,
    explode(array([create_struct(c) for c in df.columns[1:]]))))

df2.printSchema()  # this is instructional: it shows the structure is nearly there
# root
#  |-- TS: long (nullable = true)
#  |-- col: struct (nullable = false)
#  |    |-- val: integer (nullable = false)
#  |    |-- unt: integer (nullable = false)
#  |    |-- label: string (nullable = false)
#  |    |-- value: double (nullable = true)

df3 = (df2
       .groupBy(df2.TS, df2.col.val.alias("VAL"), df2.col.unt.alias("UNT"))
       .pivot("col.label", values=("sth1", "sth2"))
       .sum("col.value"))

df3.orderBy("TS", "VAL", "UNT").show()
# +---+---+---+----+----+                                                         
# | TS|VAL|UNT|sth1|sth2|
# +---+---+---+----+----+
# |  1|  0|  0| 0.0| 0.7|
# |  1|  0|  1| 0.6| 0.2|
# |  1|  1|  0| 0.1| 0.4|
# |  1|  1|  1| 0.4| 0.1|
# |  2|  0|  0| 0.6| 0.8|
# |  2|  0|  1| 0.7| 0.3|
# |  2|  1|  0| 0.1| 0.1|
# |  2|  1|  1| 0.5| 0.3|
# +---+---+---+----+----+

如果您先验地知道只有两列sth1和sth2将被透视，则可以将它们添加到pivot的{{1}}参数中，这将进一步提高效率。

Pyspark转换：列名到行

2 个答案: