我想对dataframe
进行总结。我有一些输出。我想将三个dataframe
合并到dataframe
中,与第一个完全相同。
这就是我所做的。
// Compute column summary statistics.
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val colNames=dataframe.columns
val data=dataframe.describe().show()
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
|summary| Col0| Col1| Col2| Col3| Col4|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
| count| 9999| 9999| 9999| 9999| 9999|
| mean| 0.4976937166129511| 0.5032998128645433| 0.5002933978916888| 0.5008783202471074|0.49977372871783293|
| stddev| 0.2893201326892155|0.28767789122296994|0.29041197844235034|0.28989958496291496| 0.2881033430504947|
| min|4.92436811557243E-6|3.20277176946531E-5|1.41602940923349E-5|6.53252937203857E-5| 5.4864212896146E-5|
| max| 0.999442967120299| 0.9999608020298| 0.999968873336897| 0.999836584087385| 0.999822016805327|
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+
println("Skewness")
val Skewness = dataframe.columns.map(c => skewness(c).as(c))
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).show()
偏度
+--------------------+--------------------+--------------------+--------------------+--------------------+
| Col0| Col1| Col2| Col3| Col4|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|0.015599787007160271|-0.00740111491496...|0.006096695102089171|0.003614431405637598|0.007869663345343194|
+--------------------+--------------------+--------------------+--------------------+--------------------+
println("Kurtosis")
val Kurtosis = dataframe.columns.map(c => kurtosis(c).as(c))
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).show//kurtosis
Kurtosis
+-------------------+-------------------+-------------------+-------------------+------------------+
| Col0| Col1| Col2| Col3| Col4|
+-------------------+-------------------+-------------------+-------------------+------------------+
|-1.2187774053075133|-1.1861812968784207|-1.2107252263053805|-1.2108988817869097|-1.199054929668751|
+-------------------+-------------------+-------------------+-------------------+------------------+
我想将偏斜度和峰度dataframe
添加到第一个并将其名称添加到第一个列中。
提前致谢
答案 0 :(得分:0)
您需要使用summary
skewness
和kurtosis
表添加withColumn
列
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).withColumn("summary", lit("Skewness"))
为kurtosis做同样的事情
val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).withColumn("summary", lit("Kurtosis"))
在所有Select
中使用dataframes
按顺序排列column
个名称
val orderColumn = Vector("summary", "col0", "col1", "col2", "col3", "col4")
val Skewness_ordered = Skewness_.select(orderColumn.map(col):_*)
val Kurtosis_ordered = Kurtosis_.select(orderColumn.map(col):_*)
和union
他们。
val combined = dataframe.union(Skewness_ordered).union(Kurtosis_ordered)
答案 1 :(得分:0)
您可以优雅地将数据框Skewness和Kurtosis与初始数据框合并为新数据框:
import org.apache.spark.sql.functions._
val result = dataframe.union(Skewness.select(lit("Skewness"),Skewness.col("*")))
.union(Kurtosis.select(lit("Kurtosis"),Kurtosis.col("*")))
result.show()