我正在尝试使用withColumn函数将spark数据框中的列从中间的某个位置移到第一列。
下面是我的PySpark代码:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
df_train = spark.createDataFrame([("a", 1, 2), ("a", 1, 2), ("a", 1, 3), ("a", 2, 4), ("b", 3, 5), ("c", 4, 6)], ["C1", "C2", "show_status"])
df_train.show()
columns_without_label = df_train.drop('show_status').columns
print(columns_without_label, type(columns_without_label))
for col_name in columns_without_label:
df_train_new = df_train_new.withColumn(col_name, df_train[col_name])
df_train_new.show()
以下是我得到的错误信息:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o175.withColumn.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) C2#1L missing from show_status#2L in operator !Project [show_status#2L, (C2#1L + cast(2 as bigint)) AS C2#21L].;;
!Project [show_status#2L, (C2#1L + cast(2 as bigint)) AS C2#21L]
+- Project [show_status#2L]
+- LogicalRDD [C1#0, C2#1L, show_status#2L], false