Question

从redshift读取数据时，请帮助我获得最佳性能。

选项1：我将表中的数据卸载到S3文件夹中，然后将其读取为数据框

Optin 2：我使用sqlContext读取。

我的数据量目前较少，但预计未来几个月会增长，因此当我尝试这两种方法时，几乎需要花费相同的时间。

选项：1

unload ('select * from sales_hist')   
to 's3://mybucket/tickit/unload/sales_' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

hist_output_table_df = spark.read.format(config['reader_format'])\
      .option('header', config['reader_header'])\
      .option('delimiter', config['reader_delimiter'])\
      .csv(s3_directory +  config['reader_path'])

reader_path与上面卸载的目录相同。

选项：2

 hist_output_table_df = sqlContext.read.\
                          format("com.databricks.spark.redshift")\
                          .option("url",jdbcConnection)\
                          .option("tempdir", tempS3Dir)\
                          .option("dbtable", table_name)\
                          .option("aws_iam_role",aws_role).load()

两种方法之间是否存在成本问题

Answer 1

sqlContext使用的Spark Redshift驱动程序在幕后执行UNLOAD。这就是为什么必须提供tempS3Dir的原因-这是它要卸载到的位置。

因此性能将大致相同，但我建议使用sqlContext，因为它的封装性更高。

卸载与sqlContext读取之间的性能

1 个答案: