PySpark Dataframe将列融为行

时间:2019-03-27 13:10:28

标签: python dataframe pyspark aggregate melt



这是我的数据框的一部分摘录: +-----------+----------------+-----------------+----------------+ | store_id |qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| +-----------+----------------+-----------------+----------------+ | 100| 30| 105| 35| | 200| 55| 85| 65| | 300| 20| 125| 90| +-----------+----------------+-----------------+----------------+

这是所需的结果数据框,每个商店多行,其中原始数据框的各列已融合为新数据框的各行,而新类别列中的每个原始列各有一行: +-----------+--------+-----------+ | product_id|CATEGORY|qty_on_hand| +-----------+--------+-----------+ | 100| milk| 30| | 100| bread| 105| | 100| eggs| 35| | 200| milk| 55| | 200| bread| 85| | 200| eggs| 65| | 300| milk| 20| | 300| bread| 125| | 300| eggs| 90| +-----------+--------+-----------+

最终,我想汇总结果数据框以获取每个类别的总数: +--------+-----------------+ |CATEGORY|total_qty_on_hand| +--------+-----------------+ | milk| 105| | bread| 315| | eggs| 190| +--------+-----------------+

更新: 有建议说这个问题是重复的,可以here回答。情况并非如此,因为解决方案将行转换为列,而我需要做相反的操作,将列融为行。

3 个答案:

答案 0 :(得分:1)





  • 用您的自定义标签(例如“牛奶”)注释每列
  • 将标记的列组合为“数组”类型的单个列
  • 展开标签列以生成带标签的行
  • 删除不相关的列
df = (
    df.withColumn('labels', F.explode(                         # <-- Split into rows
        F.array(                                               # <-- Combine columns
            F.array(F.lit('milk'), F.col('qty_on_hand_milk')), # <-- Annotate column
            F.array(F.lit('bread'), F.col('qty_on_hand_bread')),
            F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')),
    )).withColumn('CATEGORY', F.col('labels')[0]).withColumn('qty_on_hand', F.col('labels')[1])
).select('store_id', 'CATEGORY', 'qty_on_hand')

请注意如何仅使用col('foo')[INDEX]即可提取数组列的元素; 无需将它们分成单独的列。


例如如果“ qty_on_hand_bread”是一个字符串,则该字符串仍然有效,结果模式将为:

 |-- store_id: long (nullable = false)
 |-- CATEGORY: string (nullable = true)
 |-- qty_on_hand: string (nullable = true) <-- Picks best schema on the fly


import databricks.koalas as ks
import pyspark.sql.functions as F

# You don't need koalas, it's just less verbose for adhoc dataframes
df = ks.DataFrame({
    "store_id": [100, 200, 300],
    "qty_on_hand_milk": [30, 55, 20],
    "qty_on_hand_bread": [105, 85, 125],
    "qty_on_hand_eggs": [35, 65, 90],

# Annotate each column with your custom label per row. ie. v -> ['label', v]
df = df.withColumn('label1', F.array(F.lit('milk'), F.col('qty_on_hand_milk')))
df = df.withColumn('label2', F.array(F.lit('bread'), F.col('qty_on_hand_bread')))
df = df.withColumn('label3', F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')))

# Create a new column which combines the labeled values in a single column
df = df.withColumn('labels', F.array('label1', 'label2', 'label3'))

# Split into individual rows
df = df.withColumn('labels', F.explode('labels'))

# You can now do whatever you want with your labelled rows, eg. split them into new columns
df = df.withColumn('CATEGORY', F.col('labels')[0])
df = df.withColumn('qty_on_hand', F.col('labels')[1])


|     100|              30|              105|              35|
|     200|              55|               85|              65|
|     300|              20|              125|              90|

|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|    label1|      label2|    label3|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|

|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|    label1|      label2|    label3|              labels|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|[[milk, 30], [bre...|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]|[[milk, 55], [bre...|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|[[milk, 20], [bre...|

|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|    label1|      label2|    label3|      labels|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|  [milk, 30]|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|  [eggs, 35]|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]|  [milk, 55]|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]|  [eggs, 65]|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|  [milk, 20]|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|  [eggs, 90]|

|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|    label1|      label2|    label3|      labels|CATEGORY|qty_on_hand|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|  [milk, 30]|    milk|         30|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]|   bread|        105|
|     100|              30|              105|              35|[milk, 30]|[bread, 105]|[eggs, 35]|  [eggs, 35]|    eggs|         35|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]|  [milk, 55]|    milk|         55|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]|   bread|         85|
|     200|              55|               85|              65|[milk, 55]| [bread, 85]|[eggs, 65]|  [eggs, 65]|    eggs|         65|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|  [milk, 20]|    milk|         20|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]|   bread|        125|
|     300|              20|              125|              90|[milk, 20]|[bread, 125]|[eggs, 90]|  [eggs, 90]|    eggs|         90|

|     100|    milk|         30|
|     100|   bread|        105|
|     100|    eggs|         35|
|     200|    milk|         55|
|     200|   bread|         85|
|     200|    eggs|         65|
|     300|    milk|         20|
|     300|   bread|        125|
|     300|    eggs|         90|

答案 1 :(得分:0)

使用-pyspark的col,when, functions个模块的一种可能方法

>>> from pyspark.sql import functions as F
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import StringType
>>> concat_udf = F.udf(lambda cols: "".join([str(x) if x is not None else "*" for x in cols]), StringType())

>>> rdd = sc.parallelize([[100,30,105,35],[200,55,85,65],[300,20,125,90]])
>>> df = rdd.toDF(['store_id','qty_on_hand_milk','qty_on_hand_bread','qty_on_hand_eggs'])

|     100|              30|              105|              35|
|     200|              55|               85|              65|
|     300|              20|              125|              90|

#adding one more column with arrayed values of all three columns
>>> df_1=df.withColumn("new_col", concat_udf(F.array("qty_on_hand_milk", "qty_on_hand_bread","qty_on_hand_eggs")))
#convert it into array<int> for carrying out agg operations
>>> df_2=df_1.withColumn("new_col_1",split(col("new_col"), ",\s*").cast("array<int>").alias("new_col_1"))
#posexplode gives you the position along with usual explode which helps in categorizing
>>>"store_id",  posexplode("new_col_1").alias("col_1","qty"))
#if else conditioning for category column
>>> df_3.withColumn("category",F.when(col("col_1") == 0, "milk").when(col("col_1") == 1, "bread").otherwise("eggs")).select("store_id","category","qty").show()
|     100|    milk| 30|
|     100|   bread|105|
|     100|    eggs| 35|
|     200|    milk| 55|
|     200|   bread| 85|
|     200|    eggs| 65|
|     300|    milk| 20|
|     300|   bread|125|
|     300|    eggs| 90|

#aggregating to find sum
>>> df_3.withColumn("category",F.when(col("col_1") == 0, "milk").when(col("col_1") == 1, "bread").otherwise("eggs")).select("category","qty").groupBy('category').sum().show()
|    eggs|     190|
|   bread|     315|
|    milk|     105|
>>> df_3.printSchema()
 |-- store_id: long (nullable = true)
 |-- col_1: integer (nullable = false)
 |-- qty: integer (nullable = true)

答案 2 :(得分:0)


# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum
# Creating the DataFrame
df = sqlContext.createDataFrame([(100,30,105,35),(200,55,85,65),(300,20,125,90)],('store_id','qty_on_hand_milk','qty_on_hand_bread','qty_on_hand_eggs'))
|     100|              30|              105|              35|
|     200|              55|               85|              65|
|     300|              20|              125|              90|


def to_explode(df, by):

    # Filter dtypes and split into column names and type description
    cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
    # Spark SQL supports only homogeneous columns
    assert len(set(dtypes)) == 1, "All columns have to be of the same type"

    # Create and explode an array of (column_name, column_value) structs
    kvs = explode(array([
      struct(lit(c).alias("CATEGORY"), col(c).alias("qty_on_hand")) for c in cols

    return + [kvs]).select(by + ["kvs.CATEGORY", "kvs.qty_on_hand"])


df = to_explode(df, ['store_id'])\
|         CATEGORY|qty_on_hand|
| qty_on_hand_milk|         30|
|qty_on_hand_bread|        105|
| qty_on_hand_eggs|         35|
| qty_on_hand_milk|         55|
|qty_on_hand_bread|         85|
| qty_on_hand_eggs|         65|
| qty_on_hand_milk|         20|
|qty_on_hand_bread|        125|
| qty_on_hand_eggs|         90|


df = df.withColumn('CATEGORY',expr('substring(CATEGORY, 13)'))
|    milk|         30|
|   bread|        105|
|    eggs|         35|
|    milk|         55|
|   bread|         85|
|    eggs|         65|
|    milk|         20|
|   bread|        125|
|    eggs|         90|


df = df.groupBy(['CATEGORY']).agg(sum('qty_on_hand').alias('total_qty_on_hand'))
|    eggs|              190|
|   bread|              315|
|    milk|              105|