如何安全地执行Spark数据帧行上的getAs操作?

时间:2018-11-29 17:44:18

标签: scala apache-spark

我有一个数据框,如下所示:

val df = Seq(("x", "y", 1),("x", "z", 2),("x", "a", 4), ("x", "a", 5), ("t", "y", 1), ("t", "y2", 6), ("t", "y3", 3), ("t", "y4", 5)).toDF("F1", "F2", "F3")


+---+---+---+
| F1| F2| F3|
+---+---+---+
|  x|  y|  1|
|  x|  z|  2|
|  x|  a|  4|
|  x|  a|  5|
|  t|  y|  1|
|  t| y2|  6|
|  t| y3|  3|
|  t| y4|  5|
+---+---+---+

我正在执行过滤器和值选择,如下所示:

df.filter($"F1" === "x" && $"F2"==="y").head.getInt(2)

以上工作。但是在下面出现异常:

df.filter($"F1" === "x" && $"F2"==="y").head.getDouble(2)

此外,当过滤的数据框中没有记录时,以下内容将中断:

df.filter($"F1" === "x" && $"F2"==="y1").head.getAs[Int]("F3")

那么,如何安全地执行getAs[]()并获取值呢?如果值 是整数或双精度数,我想始终将其设为双精度,如果 过滤的数据框为空,则应返回0.0。

1 个答案:

答案 0 :(得分:1)

完全不要使用动态API。使用强类型API和显式转换类型:

from google.cloud import bigquery

def main():
    ''' Load all tables '''
    client = bigquery.Client()
    bq_load_file_in_gcs(
        client,
        'gs://bucket_name/data100rows.csv',
        'CSV',
        'test_data.data100_csv_native'
    )

def bq_load_file_in_gcs(client, path, fmt, table_name):
    '''
        Load BigQuery table from Google Cloud Storage

        client - bigquery client
        path - 'gs://path/to/upload.file',
        fmt -   The format of the data files. "CSV" / "NEWLINE_DELIMITED_JSON".
                https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load.sourceFormat
        table_name - table with datasouce
    '''

    job_config = bigquery.LoadJobConfig()
    job_config.autodetect = True
    job_config.skip_leading_rows = 1
    job_config.source_format = fmt

    load_job = client.load_table_from_uri(
        path,
        table_name,
        job_config=job_config
    )

    assert load_job.job_type == 'load'

    load_job.result()  # Waits for table load to complete.

    assert load_job.state == 'DONE'

用法示例:

import org.apache.spark.sql.DataFrame

def get(df: DataFrame) = df.select($"F3".as[Double])
  .take(1).headOption.getOrElse(0.0)