我有一个名为spark-dataframe
的输入df
,
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
在这里,Total_Count
是P1,P2,P3
的总和,而P1,P2,P3
是product names
的总和。我需要通过将产品的值除以其frequency
来找到每个产品的Total_Count
。我需要创建一个名为spark-dataframe
的新frequencyTable
,如下所示
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
我已经使用Scala作为
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
但是我不喜欢此for循环,因为我的df
的尺寸较大。什么是优化的解决方案?
答案 0 :(得分:1)
如果数据框为
val df = Seq(
("725153", 1, 0, 2, 3),
("873008", 0, 0, 3, 3),
("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")
+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153 |1 |0 |2 |3 |
|873008 |0 |0 |3 |3 |
|625109 |1 |1 |0 |2 |
+---------------+---+---+---+-----------+
您可以简单地在foldLeft
,Main_CustomerID
以外的列上使用Total_Count
,即在P1
P2
和P3
上使用
val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList
df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)
应该给您
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153 |0.3333333333333333|0.0|0.6666666666666666|3 |
|873008 |0.0 |0.0|1.0 |3 |
|625109 |0.5 |0.5|0.0 |2 |
+---------------+------------------+---+------------------+-----------+
我希望答案会有所帮助