假设我有一个火花数据框,其中包含分类列(学校,类型,组)
------------------------------------------------------------
StudentID | School | Type | Group
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts
2 | ABC | Elementary | Football
3 | DEF | Secondary | Basketball-Cricket
4 | DEF | Secondary | Cricket
------------------------------------------------------------
我需要向数据框添加一列,如下所示:
--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------
额外的列是所有类别列的组合,但在“组”列上包含不同的处理。 “组”列的值需要在“-”上分割。
所有包含“组”的类别列都包含在一个列表中。 “组”列也作为字符串输入,作为要拆分的列。数据框还有其他未使用的列。
我正在寻找最佳性能的解决方案。
如果它是一个简单的数组,则可以通过单个“ withColumn”转换来完成。
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))
但是,由于在“组”列中需要进行额外的处理,因此解决方案似乎并不简单。
答案 0 :(得分:1)
使用spark.sql(),检查以下内容:
Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
.toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)
输出:
+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+
如果需要将其用作数据框,则
val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()
root
|-- school: string (nullable = true)
|-- type: string (nullable = true)
|-- group: string (nullable = true)
|-- combined_array: array (nullable = false)
| |-- element: string (containsNull = true)
更新:
动态构造sql列。
scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]
scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group
scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "
scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "
scala> df.createOrReplaceTempView("taba")
scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+
scala>
Update2:
scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]
scala> df.createOrReplaceTempView("student")
scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]
scala> df3.createOrReplaceTempView("student2")
scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+
scala>
答案 1 :(得分:0)
您需要先添加一个空列,然后像在Java中那样映射它:
StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);
df1 = df1.withColumn("Combined Array", lit(null))
.map((MapFunction<Row, Row>) row ->
RowFactory.create(...values...) // add existing values and new value here
, newSchema);
在Scala中应该非常相似。
答案 2 :(得分:0)
使用正则表达式替换来开始每个字段,并在两者之间使用“-”:
val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map{
colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
}:_*))