我正在尝试通过连接数据帧中的值来构建字符串。 例如:
val df = Seq(
("20181001","10"),
("20181002","40"),
("20181003","50")).toDF("Date","Key")
df.show
DF的输出如下。
这里我要根据数据帧的值来构建条件,例如:(Date = 20181001 and key = 10)或(Date = 20181002 and key = 40)或(Date = 20181003 and key = 50 )等。.生成的条件将用作另一个过程的输入。这里,数据框中的列可以是动态的。
我正在尝试下面的代码片段,它正在根据需要形成字符串,但它是一个静态的字符串。也不太确定当我必须生成超过10列的条件时它将如何执行。任何建议都将受到高度赞赏。
val df = Seq(
("20181001","10"),
("20181002","40"),
("20181003","50")).toDF("Date","Key")
val colList = df.columns
var cond1 = ""
var finalCond =""
for (row <- df.rdd.collect)
{
cond1 = "("
var pk = row.mkString(",").split(",")(0)
cond1 = cond1+colList(0)+"="+pk
var ak = row.mkString(",").split(",")(1)
cond1 = cond1 +" and " + colList(1)+ "=" +ak +")"
finalCond = finalCond + cond1 + " or "
cond1= ""
}
print("Condition:" +finalCond.dropRight(3))
答案 0 :(得分:2)
检查此DF解决方案。
scala> val df = Seq(
| ("20181001","10"),
| ("20181002","40"),
| ("20181003","50")).toDF("Date","Key")
df: org.apache.spark.sql.DataFrame = [Date: string, Key: string]
scala> val df2 = df.withColumn("gencond",concat(lit("(Date="), 'Date, lit(" and Key=") ,'Key,lit(")")))
df2: org.apache.spark.sql.DataFrame = [Date: string, Key: string ... 1 more field]
scala> df2.agg(collect_list('gencond)).show(false)
+------------------------------------------------------------------------------------+
|collect_list(gencond) |
+------------------------------------------------------------------------------------+
|[(Date=20181001 and Key=10), (Date=20181002 and Key=40), (Date=20181003 and Key=50)]|
+------------------------------------------------------------------------------------+
EDIT1
您可以从镶木地板文件中读取它们,只需按照此解决方案更改名称即可。在最后一步中,再次替换实木复合地板标题中的名称。 检查一下。
scala> val df = Seq(("101","Jack"),("103","wright")).toDF("id","name") // Original names from parquet
df: org.apache.spark.sql.DataFrame = [id: string, name: string]
scala> val df2= df.select("*").toDF("Date","Key") // replace it with Date/Key as we used in this question
df2: org.apache.spark.sql.DataFrame = [Date: string, Key: string]
scala> val df3 = df2.withColumn("gencond",concat(lit("(Date="), 'Date, lit(" and Key=") ,'Key,lit(")")))
df3: org.apache.spark.sql.DataFrame = [Date: string, Key: string ... 1 more field]
scala> val df4=df3.agg(collect_list('gencond).as("list"))
df4: org.apache.spark.sql.DataFrame = [list: array<string>]
scala> df4.select(concat_ws(" or ",'list)).show(false)
+----------------------------------------------------+
|concat_ws( or , list) |
+----------------------------------------------------+
|(Date=101 and Key=Jack) or (Date=103 and Key=wright)|
+----------------------------------------------------+
scala> val a = df.columns(0)
a: String = id
scala> val b = df.columns(1)
b: String = name
scala> df4.select(concat_ws(" or ",'list).as("new1")).select(regexp_replace('new1,"Date",a).as("colx")).select(regexp_replace('colx,"Key",b).as("colxy")).show(false)
+--------------------------------------------------+
|colxy |
+--------------------------------------------------+
|(id=101 and name=Jack) or (id=103 and name=wright)|
+--------------------------------------------------+
scala>
答案 1 :(得分:0)
调用collect将结果拉回到驱动程序,因此,如果您有庞大的DataFrame,则可能会耗尽内存。
如果您确定只处理少量的行,那不是问题。
您可以执行以下操作:
df.map(row => s"($Date={row.getString(0)} and Key=${row.getString(1)})").collect.mkString("Condition: ", " or ", "")
输出:
res2: String = Condition: (Date=20181001 and Key=10) or (Date=20181002 and Key=40) or (Date=20181003 and Key=50)
答案 2 :(得分:0)
使用udf
可以对可变数量的columns
进行如下操作
val list=List("Date","Key")
def getCondString(row:Row):String={
"("+list.map(cl=>cl+"="+row.getAs[String](cl)).mkString(" and ")+")"
}
val getCondStringUDF=udf(getCondString _)
df.withColumn("row", getCondStringUDF(struct(df.columns.map(df.col(_)):_*))).select("row").rdd.map(_(0).toString()).collect().mkString(" or ")