我正在尝试将RDBMS表引入Hive。我已经通过以下方式获得了数据框:
geography:string|
project:string|
reference_code:string
product_line:string
book_type:string
cc_region:string
cc_channel:string
cc_function:string
pl_market:string
ptd_balance:double
qtd_balance:double
ytd_balance:double
xx_last_update_tms:timestamp
xx_last_update_log_id:int
xx_data_hash_code:string
xx_data_hash_id:bigint
这些是数据框的列:
ptd_balance, qtd_balance, ytd_balance
列ptd_balance_text, qtd_balance_text, ytd_balance_text
是双精度数据类型,是精度列。我们的项目希望通过创建具有相同数据的新列withColumn
将其数据类型从Double转换为String,以避免任何数据截断。
withColumnRenamed
将在数据框中创建一个新列。
import com.Gazzali.Main;
public class RunPackTest {
public static void main(String[] args) {
Main.main(null);
}
}
将重命名现有的列。
数据框具有近1000万条记录。 是否有一种有效的方法来创建多个新列,这些新列具有与数据框中现有列相同的数据和不同的类型?
答案 0 :(得分:1)
您可以像下面这样从所有query
创建columns
import org.apache.spark.sql.types.StringType
//Input:
scala> df.show
+----+-----+--------+--------+
| id| name| salary| bonus|
+----+-----+--------+--------+
|1001|Alice| 8000.25|1233.385|
|1002| Bob|7526.365| 1856.69|
+----+-----+--------+--------+
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- salary: double (nullable = false)
|-- bonus: double (nullable = false)
//solution approach:
val query=df.columns.toList.map(cl=>if(cl=="salary" || cl=="bonus") col(cl).cast(StringType).as(cl+"_text") else col(cl))
//Output:
scala> df.select(query:_*).printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- salary_text: string (nullable = false)
|-- bonus_text: string (nullable = false)
scala> df.select(query:_*).show
+----+-----+-----------+----------+
| id| name|salary_text|bonus_text|
+----+-----+-----------+----------+
|1001|Alice| 8000.25| 1233.385|
|1002| Bob| 7526.365| 1856.69|
+----+-----+-----------+----------+
答案 1 :(得分:1)
如果我不满意,我会在提取查询中进行更改,或者要求BI团队做出一些努力:P ,以便在提取时即时添加和投射字段,但无论如何你问的是可能的。
您可以从现有列中添加列,如下所示。检查addColsTosampleDF
dataframe
。我希望下面的评论足以理解您的意见,如果您有任何问题可以随时添加,我会编辑我的答案。
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
scala> val ss = SparkSession.builder().appName("TEST").getOrCreate()
18/08/07 15:51:42 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
ss: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6de4071b
//Sample dataframe with int, double and string fields
scala> val sampleDf = Seq((100, 1.0, "row1"),(1,10.12,"col_float")).toDF("col1", "col2", "col3")
sampleDf: org.apache.spark.sql.DataFrame = [col1: int, col2: double ... 1 more field]
scala> sampleDf.printSchema
root
|-- col1: integer (nullable = false)
|-- col2: double (nullable = false)
|-- col3: string (nullable = true)
//Adding columns col1_string from col1 and col2_doubletostring from col2 with casting and alias
scala> val addColsTosampleDF = sampleDf.
select(sampleDf.col("col1"),
sampleDf.col("col2"),
sampleDf.col("col3"),
sampleDf.col("col1").cast("string").alias("col1_string"),
sampleDf.col("col2").cast("string").alias("col2_doubletostring"))
addColsTosampleDF: org.apache.spark.sql.DataFrame = [col1: int, col2: double ... 3 more fields]
//Schema with added columns
scala> addColsTosampleDF.printSchema
root
|-- col1: integer (nullable = false)
|-- col2: double (nullable = false)
|-- col3: string (nullable = true)
|-- col1_string: string (nullable = false)
|-- col2_doubletostring: string (nullable = false)
scala> addColsTosampleDF.show()
+----+-----+---------+-----------+-------------------+
|col1| col2| col3|col1_string|col2_doubletostring|
+----+-----+---------+-----------+-------------------+
| 100| 1.0| row1| 100| 1.0|
| 1|10.12|col_float| 1| 10.12|
+----+-----+---------+-----------+-------------------+