Question

我在PySpark中定义了一个函数-

def add_ids(X):
    schema_new = X.schema.add("id_col", LongType(), False)
    _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)
    cols_arranged = [_X.columns[-1]] + _X.columns[0:len(_X.columns) - 1]
    return _X.select(*cols_arranged)

在上面的函数中，我正在创建一个新列（名称为id_col），该列将附加到数据框上，该数据框基本上只是每一行的索引号，并且最终将{{1 }}到最左侧。

我正在使用的数据

id_col

函数的输出

>>> X.show(4)
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 4 rows

所有这些都可以正常工作，但是问题是当我运行以下两个命令时

>>> add_ids(X).show(4)
+------+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|id_col|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+------+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|     0|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|     1|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|     2|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|     3|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
+------+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 4 rows

如果查看>>> X.show(4) +-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+ |Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome| +-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+ | 6| 148| 72| 35| 0|33.6| 0.627| 50| 1| | 1| 85| 66| 29| 0|26.6| 0.351| 31| 0| | 8| 183| 64| 0| 0|23.3| 0.672| 32| 1| | 1| 89| 66| 23| 94|28.1| 0.167| 21| 0| +-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+ only showing top 4 rows >>> X.columns ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'id_col']的结果，您会在末尾注意到X.columns。但是，当我较早地运行id_col行时，它没有将X.show(4)显示为一列。

现在，当我尝试运行id_col时，出现以下错误

add_ids(X).show(4)

我做错了什么？

Answer 1

错误在这里：

schema_new = X.schema.add("id_col", LongType(), False)

如果选中the source，则会看到add方法会修改数据。

在一个简化的示例中更容易看到：

from pyspark.sql.types import *

schema = StructType()
schema.add(StructField("foo", IntegerType()))

schema

StructType(List(StructField(foo,IntegerType,true)))

您看到schema对象已被修改。

您应该重建架构，而不是使用add方法：

schema_new = StructType(schema.fields + [StructField("id_col", LongType(), False)])

或者，您可以创建对象的深层副本：

import copy

old_schema = StructType()
new_schehma = copy.deepcopy(old_schema).add(StructField("foo", IntegerType()))

old_schema

StructType(List())

new_schehma

StructType(List(StructField(foo,IntegerType,true)))

向StructType添加新列时的歧义行为

1 个答案: