Question

我有一个名为“df_array”的火花数据框，它总会返回一个单独的数组作为输出，如下所示。

arr_value
[M,J,K]

我想提取它的值并添加到另一个数据帧。下面是我正在执行的代码

val new_df = old_df.withColumn("new_array_value", df_array.col("UNCP_ORIG_BPR"))

但我的代码总是无法说“org.apache.spark.sql.AnalysisException：resolved attribute（s）”

有人可以帮助我吗

Answer 1

此处所需的操作是join

您需要在两个数据框中都有一个公共列，它将用作“密钥”。

加入后，您可以select将哪些列包含在新数据框中。

更详细的信息可以在这里找到： https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

加入（其他，开启=无，如何=无）

Joins with another DataFrame, using the given join expression.
Parameters: 

    other – Right side of the join
    on – a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
    how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, leftsemi.

The following performs a full outer join between df1 and df2.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name=u'Bob', height=85), Row(name=u'Alice', height=None)]

Answer 2

如果您知道df_array只有一个记录，则可以使用first()将其收集到驱动程序，然后将其用作文字值< / strong>在任何DataFrame中创建一个列：

import org.apache.spark.sql.functions._ // first - collect that single array to driver (assuming array of strings): val arrValue = df_array.first().getAs[mutable.WrappedArray[String]](0) // now use lit() function to create a "constant" value column: val new_df = old_df.withColumn("new_array_value", array(arrValue.map(lit): _*)) new_df.show() // +--------+--------+---------------+ // |old_col1|old_col2|new_array_value| // +--------+--------+---------------+ // | 1| a| [M, J, K]| // | 2| b| [M, J, K]| // +--------+--------+---------------+

从spark数据帧中提取列值并将其添加到另一个数据帧

2 个答案: