我有一个Avro模式文件,我需要通过pyspark在Databricks中创建一个表。我不需要加载数据,只想创建表。最简单的方法是加载JSON字符串,并从"name"
数组中提取"type"
和fields
。然后生成CREATE
SQL查询。我想知道是否有任何编程方法可以使用任何API进行此操作。示例架构-
{
"type" : "record",
"name" : "kylosample",
"doc" : "Schema generated by Kite",
"fields" : [ {
"name" : "registration_dttm",
"type" : "string",
"doc" : "Type inferred from '2016-02-03T07:55:29Z'"
}, {
"name" : "id",
"type" : "long",
"doc" : "Type inferred from '1'"
}, {
"name" : "first_name",
"type" : "string",
"doc" : "Type inferred from 'Amanda'"
}, {
"name" : "last_name",
"type" : "string",
"doc" : "Type inferred from 'Jordan'"
}, {
"name" : "email",
"type" : "string",
"doc" : "Type inferred from 'ajordan0@com.com'"
}, {
"name" : "gender",
"type" : "string",
"doc" : "Type inferred from 'Female'"
}, {
"name" : "ip_address",
"type" : "string",
"doc" : "Type inferred from '1.197.201.2'"
}, {
"name" : "cc",
"type" : [ "null", "long" ],
"doc" : "Type inferred from '6759521864920116'",
"default" : null
}, {
"name" : "country",
"type" : "string",
"doc" : "Type inferred from 'Indonesia'"
}, {
"name" : "birthdate",
"type" : "string",
"doc" : "Type inferred from '3/8/1971'"
}, {
"name" : "salary",
"type" : [ "null", "double" ],
"doc" : "Type inferred from '49756.53'",
"default" : null
}, {
"name" : "title",
"type" : "string",
"doc" : "Type inferred from 'Internal Auditor'"
}, {
"name" : "comments",
"type" : "string",
"doc" : "Type inferred from '1E+02'"
} ]
}
答案 0 :(得分:0)
这似乎还不能通过Python API来使用。。。这是我过去通过Spark SQL创建一个指向您导出的.avsc的外部表的方式,因为您只想创建一个表并不加载任何数据...例如:
spark.sql("""
create external table db.table_name
STORED AS AVRO
LOCATION 'PATH/WHERE/DATA/WILL/BE/STORED'
TBLPROPERTIES('avro.schema.url'='PATH/TO/SCHEMA.avsc')
""")
Spark 2.4中的本机Scala API似乎现在已提供.avsc阅读器……由于您使用的是Databricks,因此可以像%scala or %python or %sql
这样的笔记本中更改内核……Scala示例:
import org.apache.avro.Schema
val schema = new Schema.Parser().parse(new File("user.avsc"))
spark
.read
.format("avro")
.option("avroSchema", schema.toString)
.load("/tmp/episodes.avro")
.show()
Spark 2.4 Avro集成的参考文档=>
https://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration