Question

我有一个Spark DataFrame，如下所示

df.show()
+------+------+------+
|  col1|  col2|  col3|
+------+------+------+
|   5.0|   5.0|   0.0|
|   2.0|   3.0|   5.0|
|   4.0|   1.0|  10.0|
+------+------+------+

我想规范化每个行，以便在操作之后，新列看起来像：

+--------+--------+--------+
|new_col1|new_col2|new_col3|
+--------+--------+--------+
|     0.5|     0.5|     0.0|
|     0.2|     0.3|     0.5|
|0.266667|0.066667|0.666667|
+--------+--------+--------+

更正式地说，我想申请的操作是：

每行

，

    new_col_i = col_i / (col_1 + col_2 + col_3)

我需要以编程方式执行此操作，而不是列出所有列，因为我的DataFrame有很多列。

当前解决方案：

我想到的当前解决方案是创建一个列来表示每行的所有条目的总和，然后将每列除以该和列。

var newDF = df.withColumn("total", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))

for (c <- Array("col1", "col2", "col3")) {
    newDF = newDF.withColumn("normalized_" + c, col(c).divide(col("total")))
}
newDF.show()

+----+----+----+-----+-------------------+-------------------+------------------+
|col1|col2|col3|total|    normalized_col1|    normalized_col2|   normalized_col3|
+----+----+----+-----+-------------------+-------------------+------------------+
| 5.0| 5.0| 0.0| 10.0|                0.5|                0.5|               0.0|
| 2.0| 3.0| 5.0| 10.0|                0.2|                0.3|               0.5|
| 4.0| 1.0|10.0| 15.0|0.26666666666666666|0.06666666666666667|0.6666666666666666|
+----+----+----+-----+-------------------+-------------------+------------------+

使代码更简洁的任何替代方法？

Answer 1

您的解决方案是正确的，无法改进太多。您可以通过用var替换for循环来摆脱foldLeft的非习惯用法，并使用更多的语法糖，但除此之外它将保持不变：< / p>

val withTotal = df.withColumn("total", df.columns.map(col).reduce(_ + _))

val result = df.columns.foldLeft(withTotal) {
  case (tmp, c) => tmp.withColumn(s"new_$c", $"$c" / $"total")
}
  .drop(df.columns: _*)
  .drop("total")

Answer 2

对于任何想在PySpark中进行行标准化的人来说，下面的代码对我有用：

new_df = df.withColumn('total', sum(df[col] for col in df.columns))
my_schema = StructType([StructField(col, DoubleType(), True) for col in df.columns])
result  = new_df.rdd.map(lambda x: [100.00 * x[i]/x[len(x) -1] for i in range(len(x)-1)]).toDF(schema = my_schema)
result.show()

+------------------+-----------------+-----------------+ | col1 | col2 | col3 | +------------------+-----------------+-----------------+ | 50.0| 50.0| 0.0| | 20.0| 30.0| 50.0| |26.666666666666668|6.666666666666667|66.66666666666667| +------------------+-----------------+-----------------+

Spark：规范化DataFrame的每一行

当前解决方案：

2 个答案: