如何从Pyspark中的SQL中提取列名和列类型

时间:2019-05-30 18:46:21

标签: python sql apache-spark pyspark pyspark-sql

用于创建的Spark SQL查询类似于this-

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING datasource
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

其中[x]表示x是可选的。如果通过CREATE sql查询,我希望输出为以下顺序的元组-

(db_name, table_name, [(col1 name, col1 type), (col2 name, col2 type), ...])

那么有什么方法可以使用pyspark sql函数或者需要正则表达式的帮助吗?

如果正则表达式可以帮助正则表达式?

1 个答案:

答案 0 :(得分:2)

可以通过通过java_gateway访问非官方API来完成:

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan("CREATE TABLE foobar.test (foo INT, bar STRING) USING json")
print(f"database: {plan.tableDesc().identifier().database().get()}")
print(f"table: {plan.tableDesc().identifier().table()}")
# perhaps there is a better way to convert the schemas, using JSON string hack here
print(f"schema: {StructType.fromJson(json.loads(plan.tableDesc().schema().json()))}")

输出:

database: foobar
table: test
schema: StructType(List(StructField(foo,IntegerType,true),StructField(bar,StringType,true)))

请注意,如果未定义数据库并且应该正确处理Scala选项,database().get()将会失败。另外,如果使用CREATE TEMPORARY VIEW,则访问器的名称也不同。命令可以在这里找到 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58