加入后,Pyspark列名称没有别名

时间:2019-11-14 18:20:05

标签: python dataframe pyspark

我有2个数据框,将它们连接后,当我打印连接框的架构时,我得到的是列名称,而不是嵌套名称

@SpringBootTest
public class MySteps {
  private final MyContext myContext = new MyContext();
}

加入框架之后,架构仅具有列名,而不具有各自的别名

df1 = sc.parallelize([
    ("1984-01-01", 1, 638.55),
    ("1984-01-02", 2, 638.55)
]).toDF(["date1", "hour", "value1"])

df2 = sc.parallelize([
    ("1984-01-01", 1, 638.55),
    ("1984-02-01", 2, 638.55)
]).toDF(["date2", "hour", "value2"])

# df1
# +----------+----+------+
# |     date1|hour|value1|
# +----------+----+------+
# |1984-01-01|   1|638.55|
# |1984-01-02|   2|638.55|
# +----------+----+------+

# df2
# +----------+----+------+
# |     date2|hour|value2|
# +----------+----+------+
# |1984-01-01|   1|638.55|
# |1984-02-01|   2|638.55|
# +----------+----+------+

相反,我希望它像下面那样打印

joined_frame = df1.alias('df1').join(df2.alias('df2'), ['hour'])

joined_frame.printSchema()
root
 |-- hour: long (nullable = true)
 |-- date1: string (nullable = true)
 |-- value1: double (nullable = true)
 |-- date2: string (nullable = true)
 |-- value2: double (nullable = true)

此外,当我尝试打印列名时,它只是给出了子列名

root
 |-- df1
      |-- date1: string (nullable = true)
      |-- hour: long (nullable = true)
      |-- value1: double (nullable = true)
 |-- df2
      |-- date2: string (nullable = true)
      |-- hour: long (nullable = true)
      |-- value2: double (nullable = true)

当我尝试访问某些列时,出现以下错误

joined_frame.columns
['hour', 'date1', 'value1', 'date2', 'value2']

基本上,如何获得具有以下别名的“ joined_frame”列?

org.apache.spark.sql.AnalysisException: cannot resolve '`hour1`' given input columns: [df1.date1, df1.value1, df2.date2, df2.value2, hour]

0 个答案:

没有答案