翻转数据帧

时间:2018-05-28 11:52:05

标签: python apache-spark pyspark

我正在使用Python 2开发Databricks。

我有一个PySpark数据框,如:

Name   | Views
--------------
Germany| 5
USA    | 3 
UAE    | 3
Turkey | 42
Canada | 12

正如您所看到的,由数百列和仅一行组成。

我想以某种方式翻转它:

dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()

我该如何处理?

编辑:我有数百列,所以我不能写下来。我不知道他们中的大多数,但他们只是存在于那里。我不能在这个过程中使用列名。

编辑2:示例代码:

.replace(';', ' ')

2 个答案:

答案 0 :(得分:1)

这个答案可能有点“矫枉过正”,但它不会使用Pandas或向驱动程序收集任何东西。当你有多行时它也会工作。我们可以将空列表从“How to melt Spark DataFrame?

传递给融合函数

一个工作示例如下:

import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable 

try:
    sc
except NameError:
    sc = ps.SparkContext()
    sqlContext = SQLContext(sc)

# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
        df: DataFrame, 
        id_vars: Iterable[str], value_vars: Iterable[str], 
        var_name: str="variable", value_name: str="value") -> DataFrame:
    """Convert :class:`DataFrame` from wide to long format."""

    # Create array<struct<variable: str, value: ...>>
    _vars_and_vals = array(*(
        struct(lit(c).alias(var_name), col(c).alias(value_name)) 
        for c in value_vars))

    # Add to the DataFrame and explode
    _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))

    cols = id_vars + [
            col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
    return _tmp.select(*cols)

# Sample data
df1 = sqlContext.createDataFrame(
    [(0,1,2,3,4)],
    ("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()

输出:

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   0|   1|   2|   3|   4|
+----+----+----+----+----+

+--------+-----+
|variable|value|
+--------+-----+
|    col1|    0|
|    col2|    1|
|    col3|    2|
|    col4|    3|
|    col5|    4|
+--------+-----+

希望这有帮助。

答案 1 :(得分:-3)

您可以将pyspark数据帧转换为pandas数据帧并使用转置函数

%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF() 
dt.show()

enter image description here

dt.toPandas().transpose()

输出: enter image description here

其他解决方案

dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF() 
df.show()

a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)