Google Cloud - Pub / Sub into DataFlow

时间:2018-02-07 08:52:10

标签: json google-bigquery google-cloud-platform google-cloud-dataflow google-cloud-pubsub

我通过REST请求调用Pub / Sub。我试图将列化数据放在Pub / Sub上的主题上,然后进入DataFlow,最后进入Big Query,其中定义了一个Table。

这是所述JSON数据的布局:

[
  {
    "age": "58",
    "job": "management",
    "marital": "married",
    "education": "tertiary",
    "default": "no",
    "balance": "2143",
    "housing": "yes",
    "loan": "no",
    "contact": "unknown",
    "day": "5",
    "month": "may",
    "duration": "261",
    "campaign": "1",
    "pdays": "-1",
    "previous": "0",
    "poutcome": "unknown",
    "y": "no"
    }
]

现在,为了形成正确的JSON主体,需要进入以下Pub / Sub识别请求:

{
    "messages": [{
        "attributes": {
            "key": "iana.org/language_tag",
            "value": "en"
        },
        "data": "%DATA%"
    }]
}

现在,Pub / Sub REST引用声明了" Data"字段需要转换为Base64,这就是我所做的,最终的JSON格式如下(%DATA%被替换为原始消息数据的Base64转换)

{
    "messages": [{
        "attributes": {
            "key": "iana.org/language_tag",
            "value": "en"
        },
        "data": "Ww0KICB7DQogICAgImFnZSI6ICI1OCIsDQogICAgImpvYiI6ICJtYW5hZ2VtZW50IiwNCiAgICAibWFyaXRhbCI6ICJtYXJyaWVkIiwNCiAgICAiZWR1Y2F0aW9uIjogInRlcnRpYXJ5IiwNCiAgICAiZGVmYXVsdCI6ICJubyIsDQogICAgImJhbGFuY2UiOiAiMjE0MyIsDQogICAgImhvdXNpbmciOiAieWVzIiwNCiAgICAibG9hbiI6ICJubyIsDQogICAgImNvbnRhY3QiOiAidW5rbm93biIsDQogICAgImRheSI6ICI1IiwNCiAgICAibW9udGgiOiAibWF5IiwNCiAgICAiZHVyYXRpb24iOiAiMjYxIiwNCiAgICAiY2FtcGFpZ24iOiAiMSIsDQogICAgInBkYXlzIjogIi0xIiwNCiAgICAicHJldmlvdXMiOiAiMCIsDQogICAgInBvdXRjb21lIjogInVua25vd24iLA0KICAgICJ5IjogIm5vIg0KICAgIH0NCl0="
    }]
}

Pub / Sub允许这些数据,然后将其放入DataFlow,但这是一切都中断的地方。 DataFlow尝试反序列化信息,但失败时显示以下消息:

(efdf538fc01f50b0): java.lang.RuntimeException: Unable to parse input
        com.google.cloud.teleport.templates.common.BigQueryConverters$JsonToTableRow$1.apply(BigQueryConverters.java:58)
        com.google.cloud.teleport.templates.common.BigQueryConverters$JsonToTableRow$1.apply(BigQueryConverters.java:47)
        org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:122)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Can not deserialize instance of com.google.api.services.bigquery.model.TableRow out of START_ARRAY token
 at [Source: [{"age":"32","job":"\"admin.\"","marital":"\"single\"","education":"\"secondary\"","default":"\"no\"","balance":"5","housing":"\"yes\"","loan":"\"no\"","contact":"\"unknown\"","day":"12","month":"\"may\"","duration":"593","campaign":"2","pdays":"-1","previous":"0","poutcome":"\"unknown\"","y":"\"no\""}]; line: 1, column: 1]

我认为这与"data":字段的格式化方式有关,但我尝试了其他方法,但我无法正常工作。

1 个答案:

答案 0 :(得分:5)

经过进一步的实验,问题确实是如何格式化JSON。删除开头df['Ship Mode'].value_counts(normalize=True) Out[3]: Standard Class 0.597158 Second Class 0.194617 First Class 0.153892 Same Day 0.054333 Name: Ship Mode, dtype: float64 并关闭[ DataFlow确实能够识别数据,然后将其放入BigQuery中。