在特定条件下将PySpark数据框列转换为行

时间:2018-12-25 14:14:06

标签: apache-spark dataframe pyspark bigdata

以下是我的数据样本输入。从C开始可以有多列,分数各不相同。

输出将遵循此逻辑-对于A的特定值,将针对每一行固定A,B和E列。输入(C,E ... X)的每一列将对应于每一行。如果遇到空值,则需要丢弃它,然后在下一行中搜索。一旦遇到C或D以获取任何特定的A值,我将转到下一个A值。 简而言之,对于每个A值,我们都需要C和d的最小值。

Input dataframe Output

1 个答案:

答案 0 :(得分:0)

您说的是<< em>简而言之,对于每个A值,我们都需要C和D的最小值。'因此,通过这种逻辑,我计算了一个A的C和D的最小值。特别是A。输出的第三行与我的不匹配,因为130 D的最小值为100.09。如果逻辑上有一些更改,则可以根据需要进行相应的更改。

from pyspark.sql.types import StringType, FloatType 
from pyspark.sql import Row, window
from pyspark.sql.functions import array, col, explode, struct, lit

schema = StructType([StructField('A', StringType()), StructField('B',FloatType()), 
                    StructField('C',FloatType()),StructField('D',FloatType()),
                    StructField('E',FloatType())])
rows = [Row(A='123',B=None,C=100.22,D=None,E=3501.88), Row(A='123',B=None,C=102.212,D=101.2187,E=3502.88),
        Row(A='123',B=None,C=103.22,D=103.22,E=3503.22), Row(A='130', B=None, C=None, D=101.22, E=355.0),
        Row(A='130',B=None,C=None,D=102.28,E=356.8), Row(A='130',B=None,C=100.09,D=100.09,E=357.8)]
df = spark.createDataFrame(rows, schema)
df.show()
+---+----+-------+--------+-------+
|  A|   B|      C|       D|      E|
+---+----+-------+--------+-------+
|123|null| 100.22|    null|3501.88|
|123|null|102.212|101.2187|3502.88|
|123|null| 103.22|  103.22|3503.22|
|130|null|   null|  101.22|  355.0|
|130|null|   null|  102.28|  356.8|
|130|null| 100.09|  100.09|  357.8|
+---+----+-------+--------+-------+

#This function is used to explode the DataFrame
def to_long(df, by):

    # Filter dtypes and split into column names and type description
    cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
    # Spark SQL supports only homogeneous columns
    assert len(set(dtypes)) == 1, "All columns have to be of the same type"

    # Create and explode an array of (column_name, column_value) structs
    kvs = explode(array([
      struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
    ])).alias("kvs")

    return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])

df = to_long(df[['A','C','D','E']], ['A','E'])
#df.show()
df = df.select(col('A'), col('Key').alias('XX'), col('val').alias('Score'), col('E').alias('ZZ'))
#df.show()
df = df.where(col("Score").isNotNull())
#df.show()

df.registerTempTable('table_view')
df1=sqlContext.sql(
    'select A, XX, min(Score) over (partition by A) as Score, ZZ from table_view'
)
df.registerTempTable('table_view')
df1=sqlContext.sql(
    'SELECT A, XX, Score, ZZ from (select *, min(Score) over (partition by A, XX) as minScore FROM table_view) M where Score = minScore'
)
df1.show()
+---+---+--------+-------+
|  A| XX|   Score|     ZZ|
+---+---+--------+-------+
|123|  C|  100.22|3501.88|
|123|  D|101.2187|3502.88|
|130|  C|  100.09|  357.8|
|130|  D|  100.09|  357.8|
+---+---+--------+-------+