在Scala中将Spark数据框列与其行连接起来

时间:2018-10-08 18:13:05

标签: scala apache-spark apache-spark-sql

我正在尝试通过连接数据帧中的值来构建字符串。 例如:

val df = Seq(
  ("20181001","10"),     
  ("20181002","40"),
  ("20181003","50")).toDF("Date","Key")
df.show

DF的输出如下。

enter image description here

这里我要根据数据帧的值来构建条件,例如:(Date = 20181001 and key = 10)或(Date = 20181002 and key = 40)或(Date = 20181003 and key = 50 )等。.生成的条件将用作另一个过程的输入。这里,数据框中的列可以是动态的。

我正在尝试下面的代码片段,它正在根据需要形成字符串,但它是一个静态的字符串。也不太确定当我必须生成超过10列的条件时它将如何执行。任何建议都将受到高度赞赏。

val df = Seq(
  ("20181001","10"),     
  ("20181002","40"),
  ("20181003","50")).toDF("Date","Key")

val colList = df.columns
var cond1 = ""
var finalCond =""
for (row <- df.rdd.collect)
 {
    cond1 = "("
    var pk = row.mkString(",").split(",")(0)
    cond1 = cond1+colList(0)+"="+pk
    var ak = row.mkString(",").split(",")(1)
    cond1 = cond1 +" and " + colList(1)+ "=" +ak +")"
    finalCond = finalCond + cond1 + " or " 
    cond1= ""    
 }
 print("Condition:" +finalCond.dropRight(3))

enter image description here

3 个答案:

答案 0 :(得分:2)

检查此DF解决方案。

scala> val df = Seq(
       |   ("20181001","10"),
       |   ("20181002","40"),
       |   ("20181003","50")).toDF("Date","Key")
df: org.apache.spark.sql.DataFrame = [Date: string, Key: string]

scala> val df2 = df.withColumn("gencond",concat(lit("(Date="), 'Date, lit(" and Key=") ,'Key,lit(")")))
df2: org.apache.spark.sql.DataFrame = [Date: string, Key: string ... 1 more field]


scala> df2.agg(collect_list('gencond)).show(false)
+------------------------------------------------------------------------------------+
|collect_list(gencond)                                                               |
+------------------------------------------------------------------------------------+
|[(Date=20181001 and Key=10), (Date=20181002 and Key=40), (Date=20181003 and Key=50)]|
+------------------------------------------------------------------------------------+

EDIT1

您可以从镶木地板文件中读取它们,只需按照此解决方案更改名称即可。在最后一步中,再次替换实木复合地板标题中的名称。 检查一下。

scala> val df = Seq(("101","Jack"),("103","wright")).toDF("id","name")  // Original names from parquet
df: org.apache.spark.sql.DataFrame = [id: string, name: string]

scala> val df2= df.select("*").toDF("Date","Key")  // replace it with Date/Key as we used in this question
df2: org.apache.spark.sql.DataFrame = [Date: string, Key: string]

scala> val df3 = df2.withColumn("gencond",concat(lit("(Date="), 'Date, lit(" and Key=") ,'Key,lit(")")))
df3: org.apache.spark.sql.DataFrame = [Date: string, Key: string ... 1 more field]

scala> val df4=df3.agg(collect_list('gencond).as("list"))
df4: org.apache.spark.sql.DataFrame = [list: array<string>]

scala> df4.select(concat_ws(" or ",'list)).show(false)
+----------------------------------------------------+
|concat_ws( or , list)                               |
+----------------------------------------------------+
|(Date=101 and Key=Jack) or (Date=103 and Key=wright)|
+----------------------------------------------------+

scala> val a = df.columns(0)
a: String = id

scala> val b = df.columns(1)
b: String = name

scala>  df4.select(concat_ws(" or ",'list).as("new1")).select(regexp_replace('new1,"Date",a).as("colx")).select(regexp_replace('colx,"Key",b).as("colxy")).show(false)
+--------------------------------------------------+
|colxy                                             |
+--------------------------------------------------+
|(id=101 and name=Jack) or (id=103 and name=wright)|
+--------------------------------------------------+


scala>

答案 1 :(得分:0)

调用collect将结果拉回到驱动程序,因此,如果您有庞大的DataFrame,则可能会耗尽内存。

如果您确定只处理少量的行,那不是问题。

您可以执行以下操作:

df.map(row => s"($Date={row.getString(0)} and Key=${row.getString(1)})").collect.mkString("Condition: ", " or ", "")

输出:

res2: String = Condition: (Date=20181001 and Key=10) or (Date=20181002 and Key=40) or (Date=20181003 and Key=50)

答案 2 :(得分:0)

使用udf可以对可变数量的columns进行如下操作

val list=List("Date","Key")

def getCondString(row:Row):String={
    "("+list.map(cl=>cl+"="+row.getAs[String](cl)).mkString(" and ")+")"
  }

val getCondStringUDF=udf(getCondString _)
df.withColumn("row", getCondStringUDF(struct(df.columns.map(df.col(_)):_*))).select("row").rdd.map(_(0).toString()).collect().mkString(" or ")