将列表中的多列添加到一列

Question

我有一个要累加的列名列表

columns = ['col1','col2','col3']

我该如何添加这三个并将其放在新列中？（以一种自动方式，这样我可以更改列列表并获得新结果）

我想要结果的数据框：

col1   col2   col3   result
 1      2      3       6

谢谢！

Answer 1

将列表中的多列添加到一列

我尝试了很多方法，以下是我的观察结果：

PySpark的sum函数不支持列添加（Pyspark版本2.3.1）
内置python的sum函数对某些人有用，但对其他人却出错。

因此，可以使用PySpark中的expr函数来实现多列的添加，该函数将要计算的表达式作为输入。

from pyspark.sql.functions import expr

cols_list = ['a', 'b', 'c']

# Creating an addition expression using `join`
expression = '+'.join(cols_list)

df = df.withColumn('sum_cols', expr(expression))

这为我们提供了所需的列总数。我们还可以使用任何其他复杂表达式来获取其他输出。

Answer 2

尝试一下：

import React, { Component } from 'react';

class Counter extends Component {
  render() {
    return (
      <div>
        <span className={this.getBadgeClasses()}>{this.formatCount()}</span>
        <button
          className="btn btn-secondary btn-sm"
          onClick={() => this.props.onIncrement(this.props.counter)}
        >
          Increment
        </button>

        <button
          className="btn btn-danger btn-sm m-2"
          onClick={() => this.props.onDelete(this.props.counter.id)}
        >
          X
        </button>
      </div>
    );
  }
  getBadgeClasses() {
    let classes = 'badge m-2 badge-';
    classes += this.props.counter.value === 0 ? 'warning' : 'primary';
    return classes;
  }
  formatCount() {
    const { value } = this.props.counter;
    return value === 0 ? 'Zero' : value;
  }
}

export default Counter;

df = df.withColumn('result', sum(df[col] for col in df.columns))将是df中的列的列表。

Answer 3

[编辑以解释每个步骤]

如果您有静态的列列表，则可以执行以下操作：

df.withColumn("result", col("col1") + col("col2") + col("col3"))

但是，如果您不想键入整个列列表，则需要迭代生成短语col("col1") + col("col2") + col("col3")。为此，您可以将reduce方法与add函数一起使用以获取此信息：

reduce(add, [col(x) for x in df.columns])

这些列一次添加了两个，因此您将获得col(col("col1") + col("col2")) + col("col3")而不是col("col1") + col("col2") + col("col3")。但是效果是一样的。

col(x)确保获得col(col("col1") + col("col2")) + col("col3")，而不是简单的字符串concat（生成（col1col2col3）。

[TL; DR，]

结合以上步骤，您可以执行以下操作：

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

df.na.fill(0)部分用于处理数据中的空值。如果您没有空值，则可以跳过该操作，而是执行以下操作：

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

如何在pyspark的spark数据框中汇总多个列？

3 个答案:

将列表中的多列添加到一列