有没有人知道文档在哪里定义BigQuery模式?换句话说,您在上传文件时提供的JSON架构 - personsDataSchema.json
in this example。
我已经谷歌搜索了很多年,但我找不到任何关于模式架构的文档。
我能得到的最接近的是documentation about auto-detecting schemas。但是如果这种情况不合适并且您需要提供预定义的JSON模式,是否有任何关于需要哪些字段以及允许哪些值的文档?
答案 0 :(得分:4)
要定义架构,您只需要定义3个字段:name
,type
和mode
。
表格中的每个字段都必须定义这3个键。如果你有一个像这样的表:
user_id source
1 search
2 email
然后架构可以定义为:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"}]
密钥name
只描述字段名称,例如“user_id”。
键type
是数据类型,例如STRING,INTEGER,FLOAT等。目前,BigQuery支持these types:
现在,如果您打开文档,您会看到我们还有一个REPEATED字段的数据类型ARRAY
。我稍后会详细讨论它们。
第三个键mode
可以是其中之一:
NULL
)NULL
)所以,让我们以前面的例子为例,添加一个重复的字段(即ARRAY字段)来说明:
user_id source wishlist
1 search ["sku 0", "sku 1"]
2 email []
3 direct ["sku 0", "sku 3"]
架构可以定义如下:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"},
{"name": "wishlist", "type": "STRING", "mode": "REPEATED"}]
你有它,ARRAY字段被定义为字符串值的重复。
我们仍然有一种类型的字段,那就是RECORD字段(STRUCT)。这些基本相同,只是我们还为它们定义了第四个键fields
。由于RECORD包含其他字段,您还必须描述它们的定义;通过示例更容易理解:
user_id source wishlist location.country location.city
1 search ["sku 0", "sku 1"] USA NY
2 email [] USA LA
3 direct ["sku 0", "sku 3"] BR SP
此处,location
是一个记录(STRUCT),其中包含2个键:country
和city
。这就是你为它们定义架构的方式:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"},
{"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
{"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]}]
您想拥有RECORDS的REPEATED字段吗?当然,为什么不呢!例如,如果您希望客户在您的网站中拥有每个hit
的REPEATED字段,您可以像这样定义架构:
[{"name": "user_id", "type": "INT64", "mode": "REQUIRED"},
{"name": "source", "type": "STRING", "mode": "NULLABLE"},
{"name": "wishlist", "type": "STRING", "mode": "REPEATED"},
{"name": "location", "type": "RECORD", "mode": "NULLABLE", "fields": [{"name": "country", "type": "STRING", "mode": "NULLABLE"}, {"name": "city", "type": "STRING", "mode": "NULLABLE"}]},
{"name": "hit", "type": "RECORD", "mode": "REPEATED", "fields": [{"name": "hitNumber", "type": "INT64", "mode": "NULLABLE"}, {"name": "hitPage", "type": "STRING", "mode": "NULLABLE"}]}]
鉴于这一切,我们最终可以回答您的问题,如何定义dataPersons.json
架构?
这是一行personData:
的示例{"kind": "person",
"fullName": "John Doe",
"age": 22,
"gender": "Male",
"phoneNumber": {"areaCode": "206", "number": "1234567"},
"children": [{"name": "Jane", "gender": "Female", "age": "6"},
{"name": "John", "gender": "Male", "age": "15"}],
"citiesLived": [{"place": "Seattle", "yearsLived": ["1995"]},
{"place": "Stockholm", "yearsLived": ["2005"]}]}
首先,我们有"kind": "person"
。这很简单,其架构将是:
{"name": "kind", "type": "STRING", "mode": "REQUIRED" or "NULLABLE"}
phoneNumber
是一个RECORD(STRUCT)字段,有两个内部字段areaCode
和number
。好吧,我们已经看到了如何定义它们!
{"name": "phoneNumber",
"type": "RECORD",
"mode": "NULLABLE OR REQUIRED",
"fields": [{"name": "areaCode", "type": "INT64", "mode": "NULLABLE"},
{"name": "number", "type": "INT64", "mode": "NULLABLE"}]}
现在children
和citiesLived
具有相同的定义,即它们都是RECORD(STRUCT)的REPEATED(ARRAY)字段。就像在我们的最后一个例子中一样,这个也应该是直截了当的; citiesLived
将被定义为:
{"name": "citiesLived",
"type": "RECORD",
"mode": "REPEATED",
"fields": [{"name": "place", "type": "STRING", "mode": "NULLABLE"},
{"name": "yearLived", "type": "INT64", "mode": "REPEATED"}]}
你有它。这基本上就是模式定义的全部内容。例如,如果你使用Python,那么这个想法是一样的。您导入类SchemaField
以定义每个字段,如下所示:
from google.cloud.bigquery import SchemaField
field_kind = SchemaField(name="kind", type="STRING", mode="NULLABLE")
其他客户也会遵循同样的想法。
总而言之,您必须为表格中的每个字段定义3个键:name
,type
和mode
。如果该字段的类型为RECORD,那么您还必须定义fields
,并且对于每个内部字段,再次定义3个键(如果内部字段的类型为RECORD,则再次定义4个键)。
希望这对如何定义架构更加明确。如果您对此主题仍有任何疑问,请告诉我,我会更新答案。