获取PySpark中列的名称/别名

时间:2019-05-02 09:37:52

标签: apache-spark pyspark

我正在定义一个像这样的列对象:

column = F.col('foo').alias('bar')

我知道我只需做str(column)就可以得到完整的表达。但是,有没有办法只获取列别名的值?

在示例中,我正在寻找一个函数FN,其中FN(column)返回bar

2 个答案:

答案 0 :(得分:1)

一种方法是通过正则表达式:

from pyspark.sql.functions import col
column = col('foo').alias('bar')
print(column)
#Column<foo AS `bar`>

import re
print(re.findall("(?<=AS `)\w+(?=`>$)", str(column)))[0]
#'bar'

答案 1 :(得分:1)

或者,我们可以使用包装函数来调整Column.aliasColumn.name方法的行为,以将 alias 仅存储在AS属性中:< / p>

from pyspark.sql import Column, SparkSession
from pyspark.sql.functions import col, explode, array, struct, lit
SparkSession.builder.getOrCreate()

def alias_wrapper(self, *alias, **kwargs):
    renamed_col = Column._alias(self, *alias, **kwargs)
    renamed_col.AS = alias[0] if len(alias) == 1 else alias
    return renamed_col

Column._alias, Column.alias, Column.name, Column.AS = Column.alias, alias_wrapper, alias_wrapper, None

然后保证:

assert(col("foo").alias("bar").AS == "bar")
# `name` should act like `alias`
assert(col("foo").name("bar").AS == "bar")
# column without alias should have None in `AS`
assert(col("foo").AS is None)
# multialias should be handled
assert(explode(array(struct(lit(1), lit("a")))).alias("foo", "bar").AS == ("foo", "bar"))