我有超过8个模式和200多个表,并且数据由不同模式的CSV文件加载。
我想知道如何找到将所有200个表中的数据从S3加载到Redshift的平均时间的SQL脚本。
答案 0 :(得分:3)
您可以检查STL System Tables for Logging以了解查询运行的时间。
您可能需要解析查询文本以发现加载了哪些表,但您可以使用历史加载时间来计算每个表的典型加载时间。
一些特别有用的表格是:
答案 1 :(得分:2)
运行此查询以了解COPY查询的工作速度。
select q.starttime, s.query, substring(q.querytxt,1,120) as querytxt,
s.n_files, size_mb, s.time_seconds,
s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_s
from (select query, count(*) as n_files,
sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -
min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time
from stl_s3client where http_method = 'GET' and query > 0
and transfer_time > 0 group by query ) as s
LEFT JOIN stl_Query as q on q.query = s.query
where s.end_Time >= dateadd(day, -7, current_Date)
order by s.time_Seconds desc, size_mb desc, s.end_time desc
limit 50;
一旦你发现你从S3推进了多少mb / s,你就可以根据大小大致确定每个文件需要多长时间。
答案 2 :(得分:0)
有一种聪明的方法可以做到这一点。您应该有一个ETL脚本,可以将数据从S3迁移到Redshift。
假设您有一个shell脚本,只需在ETL逻辑启动该表之前捕获时间戳(让我们调用import cartopy.crs as ccrs
import cartopy.io.shapereader as shpreader
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
cmap = mpl.cm.Blues
# Countries is a dictionary of {"country_name": number of users}, for example
countries = {"United States": 100, "Canada": 50, "China": 10}
max_users = float(max(countries.values()))
shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m', category='cultural', name=shapename)
ax = plt.axes(projection=ccrs.Robinson())
for country in shpreader.Reader(countries_shp).records():
name = country.attributes['name_long']
num_users = countries[name]
ax.add_geometries(country.geometry, ccrs.PlateCarree(),
facecolor=cmap(num_users/max_users, 1))
plt.savefig('iOS_heatmap.png', transparent=True, dpi=900)
),在ETL逻辑结束该表后捕获另一个时间戳(让我们调用它start
)并在剧本结尾处采取差异:
end
PS:持续时间以秒为单位,您可以维护日志文件,输入数据库表等。时间戳将在#!bin/sh
.
.
.
start=$(date +%s) #capture start time
#ETL Logic
[find the right csv on S3]
[check for duplicates, whether the file has already been loaded etc]
[run your ETL logic, logging to make sure that file has been processes on s3]
[copy that table to Redshift, log again to make sure that table has been copied]
[error logging, trigger emails, SMS, slack alerts etc]
[ ... ]
end=$(date +%s) #Capture end time
duration=$((end-start)) #Difference (time taken by the script to execute)
echo "duration is $duration"
,您可以使用函数(取决于您记录的位置)像:
epoc
- 对于MySQL
sec_to_time($duration)
- 适用于Amazon Redshift(然后使用epoch中两个实例的差异)。