我有如下数据:
tab1,c1|c2|c3
tab2,d1|d2|d3|d4|d5
tab3,e1|e2|e3|e4
我需要将其转换为以下形式:
select c1,c2,c3 from tab1;
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
我能够得到这样的信息:
d.foreach(f=>{println("select"+" "+f+" from"+";")})
select tab3,e1,e2,e3,e4 from;
select tab1,c1,c2,c3 from;
select tab2,d1,d2,d3,d4,d5 from;
有人可以建议吗?
答案 0 :(得分:1)
我没有看到火花适合您的问题。变量“ d”代表什么?
我想这可能会有所帮助。
from pyspark.sql.types import *
from pyspark.sql.functions import *
mySchema = StructType([
StructField("table_name", StringType()),
StructField("column_name",
ArrayType(StringType())
)
])
df = spark.createDataFrame([
("tab1",["c1","c2","c3"]),
("tab2",["d1","d2","d3","d4","d5"]),
("tab3",["e1","e2","e3","e4"])
],
schema = mySchema
)
df.selectExpr('concat("select ", concat_ws(",", column_name), " from ", table_name, ";") as select_string').show(3, False)
输出:
+--------------------------------+
|select_string |
+--------------------------------+
|select c1,c2,c3 from tab1; |
|select d1,d2,d3,d4,d5 from tab2;|
|select e1,e2,e3,e4 from tab3; |
+--------------------------------+
答案 1 :(得分:1)
您还可以在RDD上使用map
操作。
假设RDD为String
,例如:
val rdd = spark.parallelize(Seq(("tab1,c1|c2|c3"), ("tab2,d1|d2|d3|d4|d5"), ("tab3,e1|e2|e3|e4")))
执行以下操作:
val select = rdd.map(str=> {
val separated = str.split(",", -1)
val table = separated(0)
val cols = separated(1).split("\\|", -1).mkString(",")
"select " + cols + " from " + table + ";"
})
您将获得预期的结果:
select.foreach(println(_))
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
select c1,c2,c3 from tab1;