我正在尝试编写数据流管道,以使用Python将数据从Google数据存储迁移到BigQuery。经过一番搜索,我发现我需要执行三个步骤:
1. ReadFromDatastore
2. Convert to Python dicts or Tablerows
3. WriteToBigQuery
现在,第一步和最后一步很简单,因为它们本身就是函数。但是我很难找到第二步的好方法。
我将ReadFromDatastore的输出写入一个文本文件,并且json如下所示:
key {
partition_id {
project_id: "ProjectID"
}
path {
kind: "KindName"
id:9999
}
}
properties {
key: "property1"
value {
string_value: "property_value"
}
}
properties {
key: "property2"
value {
string_value: ""
}
}
properties {
key: "property3"
value {
boolean_value: false
}
}
properties {
key: "created"
value {
timestamp_value {
seconds: 4444
nanos: 2222
}
}
}
properties {
key: "created_by"
value {
string_value: "property_value"
}
}
properties {
key: "date_created"
value {
timestamp_value {
seconds: 4444
}
}
}
properties {
key: "property4"
value {
string_value: "property_value"
}
}
properties {
key: "property5"
value {
array_value {
values {
meaning: 00
string_value: "link"
exclude_from_indexes: true
}
}
}
}
properties {
key: "property6"
value {
null_value: NULL_VALUE
}
}
properties {
key: "property7"
value {
string_value: "property_value"
}
}
properties {
key: "property8"
value {
string_value: ""
}
}
properties {
key: "property9"
value {
timestamp_value {
seconds: 3333
nanos: 3333
}
}
}
properties {
key: "property10"
value {
meaning: 00
string_value: ""
exclude_from_indexes: true
}
}
properties {
key: "property11"
value {
boolean_value: false
}
}
properties {
key: "property12"
value {
array_value {
values {
key_value {
partition_id {
project_id: "project_id"
}
path {
kind: "Another_kind_name"
id: 4444
}
}
}
}
}
}
properties {
key: "property13"
value {
string_value: "property_value"
}
}
properties {
key: "version"
value {
integer_value: 4444
}
}
key {
partition_id {
project_id: "ProjectID"
}
path {
kind: "KindName"
id: 9999
}
}
.
.
.
.next_entity/row
我是否必须编写一个自定义函数以将json转换为python字典才能写入BigQuery,或者我可以使用Google数据存储区或apache中的任何函数/库?
我发现一个article描述了我要做什么,但是显示的代码是Java。
答案 0 :(得分:0)
ReadFromDatastore
转换的输出是Entity
类型的协议缓冲区。
要将probbuff转换为JSON,您可以检查以下问题:Protobuf to json in python
您会这样做:
p | ReadFromDatastore(...) | beam.Map(my_proto_to_json_fn) | beam.WriteToBigQuery(...)