从spark scala中的一行创建多个行

时间:2018-05-17 14:34:27

标签: scala apache-spark

在spark中的数据框或镶木地板文件中,它有如下输入数据,它应该使用spark scala从一行生成多行。 输入:

Id    PersonName  Dept  year  Language
1     David       501   2018  English
2     Nancy       501   2018  English 
3     Shyam       502   2018  Hindi

文件或数据框中的输出应如下所示

1  David 
1  501   2018 
1  David English
2  Nancy 
2  501   2018 
2  Nancy English 
3  Shyam
3  502  2018
3  Nancy Hindi

1 个答案:

答案 0 :(得分:1)

@Arvy 我不确定你为什么要这样做。您的表应该具有一致的列。但是,这可以通过简单的选择和联合来完成。

Pyspark

创建数据框:

    from pyspark.sql.functions import lit
    col_names = ["col1", "col2", "col3"]
    df1 = df.select('ID', 'Dept', 'year').toDF(*col_names)
    df2 = df.select('ID', 'PersonName', 'Language').toDF(*col_names)
    df3 = df.select('ID', 'PersonName').withColumn('a', lit('')).toDF(*col_names)

    df_random = df1.union(df2).union(df3).orderBy('col1')
    df_random.show()

创建新数据框:

    import org.apache.spark.sql.{functions, Column, DataFrame, SQLContext}
    val col_names = Seq("col1", "col2", "col3")
    val df1 = df.select("ID", "Dept", "year").toDF(col_names: _*)
    val df2 = df.select("ID", "PersonName", "Language").toDF(col_names: _*)
    val df3 = df.select("ID", "PersonName").withColumn("a", lit("")).toDF(col_names: _*)

    val df_random = df1.union(df2).union(df3).orderBy("col1")
    df_random.show()

Scala的

创建新的DataFrame:

 Function ValChange(Cell2Follow As Range) As String
    ValChange = ""
    If Len(Application.Caller.Text) = 0 Then Exit Function
    If Application.Caller.Text = Cell2Follow.Text Then Exit Function
    MsgBox "value of the cell is changed from another value in the dropdown" & vbLf & "not from the default 'empty' value"
End Function