我有一个带有某些列的数据框,如下所示。当我使用别名为已经存在的列名选择某些列时,不会出现任何错误,并且不再可以在新数据框中使用这些列
df1 = sc.parallelize([
(1984-01-01, 1, 638.55),
(1984-01-02, 2, 638.55)
]).toDF(["date1", "hour", "value1"])
# df1
# +----------+----+------+
# | date1|hour|value1|
# +----------+----+------+
# |1984-01-01| 1|638.55|
# |1984-01-02| 2|638.55|
# +----------+----+------+
当我将某列别名如下所示时,我们再也无法从数据框中取回数据了
from pyspark.sql import functions as sf
new_df = df1.select(sf.col('date1').alias('hour'), sf.col('hour'))
# new_df
+----------+----+
| hour|hour|
+----------+----+
|1984-01-01| 1|
|1984-01-02| 2|
+----------+----+
当我尝试选择“ 小时”列时,它给出了模糊的列错误
new_df.select('hour').show()
pyspark.sql.utils.AnalysisException: "Reference 'hour' is ambiguous, could be: hour, hour.;"
我现在如何访问数据?