Question

我有超过8个模式和200多个表，并且数据由不同模式的CSV文件加载。

我想知道如何找到将所有200个表中的数据从S3加载到Redshift的平均时间的SQL脚本。

Answer 1

您可以检查STL System Tables for Logging以了解查询运行的时间。

您可能需要解析查询文本以发现加载了哪些表，但您可以使用历史加载时间来计算每个表的典型加载时间。

一些特别有用的表格是：

STL_QUERY_METRICS：包含已完成在用户定义的查询队列（服务类）中运行的查询的度量信息，例如已处理的行数，CPU使用率，输入/输出和磁盘使用。 / LI>
STL_QUERY：返回有关数据库查询的执行信息。
STL_LOAD_COMMITS：此表记录每个数据文件加载到数据库表时的进度。

Answer 2

运行此查询以了解COPY查询的工作速度。

select q.starttime,  s.query, substring(q.querytxt,1,120) as querytxt,
       s.n_files, size_mb, s.time_seconds,
       s.size_mb/decode(s.time_seconds,0,1,s.time_seconds)  as mb_per_s
from (select query, count(*) as n_files,
     sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -
         min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time
      from stl_s3client where http_method = 'GET' and query > 0
       and transfer_time > 0 group by query ) as s
LEFT JOIN stl_Query as q on q.query = s.query
where s.end_Time >=  dateadd(day, -7, current_Date)
order by s.time_Seconds desc, size_mb desc, s.end_time desc
limit 50;

一旦你发现你从S3推进了多少mb / s，你就可以根据大小大致确定每个文件需要多长时间。

Answer 3

有一种聪明的方法可以做到这一点。您应该有一个ETL脚本，可以将数据从S3迁移到Redshift。

假设您有一个shell脚本，只需在ETL逻辑启动该表之前捕获时间戳（让我们调用import cartopy.crs as ccrs import cartopy.io.shapereader as shpreader import matplotlib.pyplot as plt import matplotlib as mpl import numpy as np cmap = mpl.cm.Blues # Countries is a dictionary of {"country_name": number of users}, for example countries = {"United States": 100, "Canada": 50, "China": 10} max_users = float(max(countries.values())) shapename = 'admin_0_countries' countries_shp = shpreader.natural_earth(resolution='110m', category='cultural', name=shapename) ax = plt.axes(projection=ccrs.Robinson()) for country in shpreader.Reader(countries_shp).records(): name = country.attributes['name_long'] num_users = countries[name] ax.add_geometries(country.geometry, ccrs.PlateCarree(), facecolor=cmap(num_users/max_users, 1)) plt.savefig('iOS_heatmap.png', transparent=True, dpi=900)），在ETL逻辑结束该表后捕获另一个时间戳（让我们调用它start）并在剧本结尾处采取差异：

end

PS：持续时间以秒为单位，您可以维护日志文件，输入数据库表等。时间戳将在#!bin/sh . . . start=$(date +%s) #capture start time #ETL Logic [find the right csv on S3] [check for duplicates, whether the file has already been loaded etc] [run your ETL logic, logging to make sure that file has been processes on s3] [copy that table to Redshift, log again to make sure that table has been copied] [error logging, trigger emails, SMS, slack alerts etc] [ ... ] end=$(date +%s) #Capture end time duration=$((end-start)) #Difference (time taken by the script to execute) echo "duration is $duration"，您可以使用函数（取决于您记录的位置）像：

epoc - 对于MySQL

sec_to_time($duration) - 适用于Amazon Redshift（然后使用epoch中两个实例的差异）。

如何找到将数据从S3加载到Redshift的平均时间

3 个答案: