我试图从PySpark数据框中获取架构,并使用列值创建Hive表。由于我要对表进行分区,因此必须创建Hive表,然后加载数据。
rawSchema = df.schema
schemaString = rawSchema.fields.map(lambda field : field.name+" "+field.dataType.typeName).collect(Collectors.joining(","));
sqlQuery = "CREATE TABLE IF NOT EXISTS Hive_table COLUMNS (" + schemaString + ") PARTITIONED BY (p int) STORED AS PARQUET LOCATION Hive_table_path;"
但是第二行出现错误:
AttributeError:“列表”对象没有属性“地图”
此Scala代码工作正常,但我需要将其转换为PySpark
StructType my_schema = my_DF.schema();
String columns = Arrays.stream(my_schema.fields()).map(field ->field.name()+" "+field.dataType().typeName()).collect(Collectors.joining(","));
请帮助!
答案 0 :(得分:1)
您可以先创建列表create list
,然后再join
创建schemaString
。
Example:
df.show()
#+---+----+
#| id|name|
#+---+----+
#| a| 2|
#| b| 3|
#+---+----+
df.schema
#StructType(List(StructField(id,StringType,true),StructField(name,StringType,true)))
schemaString=','.join([f.name+" "+f.dataType.typeName() for f in df.schema.fields])
#'id string,name string'
sqlQuery = "CREATE TABLE IF NOT EXISTS Hive_table COLUMNS (" + schemaString + ") PARTITIONED BY (p int) STORED AS PARQUET LOCATION Hive_table_path;"
#'CREATE TABLE IF NOT EXISTS Hive_table COLUMNS (id string,name string) PARTITIONED BY (p int) STORED AS PARQUET LOCATION Hive_table_path;'
#Using df.dtypes
schemaString=",".join([' '.join(w) for w in df.dtypes])
#'id string,name string'
sqlQuery = "CREATE TABLE IF NOT EXISTS Hive_table COLUMNS (" + schemaString + ") PARTITIONED BY (p int) STORED AS PARQUET LOCATION Hive_table_path;"
#'CREATE TABLE IF NOT EXISTS Hive_table COLUMNS (id string,name string) PARTITIONED BY (p int) STORED AS PARQUET LOCATION Hive_table_path;'