我如何将BigQuery表加载到Dataproc集群

时间:2020-05-31 09:00:09

标签: pyspark jupyter-lab google-cloud-dataproc

我是dataproc集群和PySpark的新手,因此,在寻找将表从bigquery加载到集群的代码的过程中,我遇到了下面的代码,无法弄清楚我应该为我做些什么更改这段代码中的用例以及我们在输入目录中提供的输入内容

let timings = [{
    isOpen: 1,
    weekday: 1,
    humanDay: "Monday",
    periods: [{
        openDay: "Monday",
        openTime: "12:00",
        closeDay: "Monday",
        closeTime: "14:30",
      },
      {
        openDay: "Monday",
        openTime: "19:00",
        closeDay: "Monday",
        closeTime: "22:30",
      },
      {
        openDay: "Monday",
        openTime: "23:00",
        closeDay: "Monday",
        closeTime: "23:30",
      },
    ],
  },
  {
    isOpen: 1,
    weekday: 1,
    humanDay: "Tuesday",
    periods: [{
        openDay: "Tuesday",
        openTime: "12:00",
        closeDay: "Tuesday",
        closeTime: "14:30",
      },
      {
        openDay: "Tuesday",
        openTime: "19:00",
        closeDay: "Tuesday",
        closeTime: "22:30",
      },
      {
        openDay: "Tuesday",
        openTime: "23:00",
        closeDay: "Tuesday",
        closeTime: "23:30",
      },
    ],
  },
];

// create an empty object
const weekdays = {};

timings.forEach((timing) => {
  timing.periods.forEach((period) => {
    // check if the object has a key matching
    // the openTime to CloseTime string
    // (this can be any key, but we want to capture all
    // .. days that have the same open and close times)
    
    if (!weekdays[`${period.openTime}-${period.closeTime}`]) {
      // the key does not exist, so lets create an new sub-object for 
      // that given key, and prepare its array of days:
      weekdays[`${period.openTime}-${period.closeTime}`] = {
        days: [],
      };
    }
    
    // now, add the current day to the pre-defined sub-array:
    weekdays[`${period.openTime}-${period.closeTime}`].days.push(
      timing.humanDay
    );
    
    // also, store the openTime and closeTime as sub-properties, for convenience
    // i know they are stored in the key but the whole purpose of the key
    // is to reduce duplicates by taking advantage of javascript built in
    // funcationalities.

    weekdays[`${period.openTime}-${period.closeTime}`]["openTime"] =
      period.openTime;

    weekdays[`${period.openTime}-${period.closeTime}`]["closeTime"] =
      period.closeTime;
  });
});




console.log(weekdays);

1 个答案:

答案 0 :(得分:0)

您尝试使用Hadoop BigQuery connector,对于Spark,您应该使用Spark BigQuery connector

要从BigQuery读取数据,您可以遵循an example

# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket = "[bucket]"
spark.conf.set('temporaryGcsBucket', bucket)

# Load data from BigQuery.
words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
相关问题