如何找到将数据从S3加载到Redshift的平均时间

时间:2017-11-25 21:31:48

标签: amazon-web-services amazon-s3 amazon-redshift

我有超过8个模式和200多个表,并且数据由不同模式的CSV文件加载。

我想知道如何找到将所有200个表中的数据从S3加载到Redshift的平均时间的SQL脚本。

3 个答案:

答案 0 :(得分:3)

您可以检查STL System Tables for Logging以了解查询运行的时间。

您可能需要解析查询文本以发现加载了哪些表,但您可以使用历史加载时间来计算每个表的典型加载时间。

一些特别有用的表格是:

  • STL_QUERY_METRICS:包含已完成在用户定义的查询队列(服务类)中运行的查询的度量信息,例如已处理的行数,CPU使用率,输入/输出和磁盘使用。 / LI>
  • STL_QUERY:返回有关数据库查询的执行信息。
  • STL_LOAD_COMMITS:此表记录每个数据文件加载到数据库表时的进度。

答案 1 :(得分:2)

运行此查询以了解COPY查询的工作速度。

select q.starttime,  s.query, substring(q.querytxt,1,120) as querytxt,
       s.n_files, size_mb, s.time_seconds,
       s.size_mb/decode(s.time_seconds,0,1,s.time_seconds)  as mb_per_s
from (select query, count(*) as n_files,
     sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -
         min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time
      from stl_s3client where http_method = 'GET' and query > 0
       and transfer_time > 0 group by query ) as s
LEFT JOIN stl_Query as q on q.query = s.query
where s.end_Time >=  dateadd(day, -7, current_Date)
order by s.time_Seconds desc, size_mb desc, s.end_time desc
limit 50;

一旦你发现你从S3推进了多少mb / s,你就可以根据大小大致确定每个文件需要多长时间。

答案 2 :(得分:0)

有一种聪明的方法可以做到这一点。您应该有一个ETL脚本,可以将数据从S3迁移到Redshift。

假设您有一个shell脚本,只需在ETL逻辑启动该表之前捕获时间戳(让我们调用import cartopy.crs as ccrs import cartopy.io.shapereader as shpreader import matplotlib.pyplot as plt import matplotlib as mpl import numpy as np cmap = mpl.cm.Blues # Countries is a dictionary of {"country_name": number of users}, for example countries = {"United States": 100, "Canada": 50, "China": 10} max_users = float(max(countries.values())) shapename = 'admin_0_countries' countries_shp = shpreader.natural_earth(resolution='110m', category='cultural', name=shapename) ax = plt.axes(projection=ccrs.Robinson()) for country in shpreader.Reader(countries_shp).records(): name = country.attributes['name_long'] num_users = countries[name] ax.add_geometries(country.geometry, ccrs.PlateCarree(), facecolor=cmap(num_users/max_users, 1)) plt.savefig('iOS_heatmap.png', transparent=True, dpi=900) ),在ETL逻辑结束该表后捕获另一个时间戳(让我们调用它start)并在剧本结尾处采取差异:

end

PS:持续时间以秒为单位,您可以维护日志文件,输入数据库表等。时间戳将在#!bin/sh . . . start=$(date +%s) #capture start time #ETL Logic [find the right csv on S3] [check for duplicates, whether the file has already been loaded etc] [run your ETL logic, logging to make sure that file has been processes on s3] [copy that table to Redshift, log again to make sure that table has been copied] [error logging, trigger emails, SMS, slack alerts etc] [ ... ] end=$(date +%s) #Capture end time duration=$((end-start)) #Difference (time taken by the script to execute) echo "duration is $duration" ,您可以使用函数(取决于您记录的位置)像:

epoc - 对于MySQL

sec_to_time($duration) - 适用于Amazon Redshift(然后使用epoch中两个实例的差异)。