在这种情况下,我需要获取几个硬币对的汇率。我有2张桌子,一张带有与银行运营有关的信息,另一张带有银行考虑的每日汇率。我正在开始学习数据分析,所以请耐心等待。我的英语也不太好。
考虑以下示例:
表1(银行操作):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("partner").orderBy(F.lit(1))
df.withColumn("rowNum", F.row_number().over(w))\
.filter('rowNum<=5').drop("rowNum").show()
表2(汇率):
Op Number | Coin_1 | Coin_2 | Date | Hour 1 | Weekday |
1 | EUR | GBP | 2020/06/01 | 03:30 | Monday |
注意:周末未实现汇率。
我不知道如何获得该价值。使用脚本组件?如果可以的话,您能协助我做些算法吗?至此,我已经完成了所有需要的ETL,但似乎找不到解决此任务的方法。
答案 0 :(得分:2)
这可以使用lead窗口函数和一些日期时间数学在sql中完成。
create table #t1(
[Case] int,
[Op Number] int,
[Coin_1] varchar(10),
[Coin_2] varchar(10),
[Date] date,
[Hour 1] time,
[Weekday] varchar(10)
)
insert into #t1 values
( 1, 1, 'EUR', 'GBP', '2020/06/01', '03:30', 'Monday')
create table #t2(
[Case] int,
[Coin_1] varchar(10),
[Coin_2] varchar(10),
[Date] date,
[Hour 2] time,
[Weekday] varchar(10),
[Rate] decimal(10,2)
)
insert into #t2 values
( 1, 'EUR', 'GBP', '2020/03/01', '11:30', 'Friday', 0.6),
( 1, 'EUR', 'GBP', '2020/03/01', '18:30', 'Friday', 0.5 ),
( 1, 'EUR', 'GBP', '2020/06/01', '12:30', 'Monday', 0.55)
; with t1 as (
select *, dt = CAST(CONCAT([Date], ' ', [hour 1]) AS datetime2(0))
from #t1
)
, x as (
select *, dt = CAST(CONCAT([Date], ' ', [hour 2]) AS datetime2(0))
from #t2
)
, t2 as (
select [Case],
[Coin_1],
[Coin_2],
[Rate],
[Date]
[Hour 2],
[Weekday],
dt as start_dt,
isnull(lead(dt) over(partition by [case] order by dt asc), '20990101') end_dt
from x
)
select *
from t1
inner join t2 on t2.[case] = t1.[case]
and t1.dt >= t2.start_dt
and t1.dt < t2.end_dt
答案 1 :(得分:1)
如果这是一项学习练习,请充分利用SSIS的组成部分来进行。如果这是真实世界的东西,请相信我的经验,尝试使用SSIS来实现这一目标将不是一件令人愉快的事情。
现有数据模型面临的最大挑战之一是分别存储日期和时间。我假设源系统将其存储为日期和时间(0)数据类型。我在查询中创建了一个实际的datetime2列,以便可以利用Microsoft的高级工程师来担心正确的比较逻辑。
我将其视为OUTER APPLY with TOP 1 problem,而不是史蒂夫(Steve)提出的超前/滞后解决方案。
CREATE TABLE dbo.BankOperations
(
CaseNumber int
, Coin_1 char(3)
, Coin_2 char(3)
, TransactionDate date
, TransactionTime time(0)
);
CREATE TABLE dbo.ExchangeRates
(
CaseNumber int
, Coin_1 char(3)
, Coin_2 char(3)
, TransactionDate date
, TransactionTime time(0)
, Rate decimal(4, 2)
);
INSERT INTO
dbo.BankOperations
VALUES
(
1, 'EUR', 'GBP', '2020-06-01', '03:30'
)
-- boundary checking exact
,( 2, 'EUR', 'GBP', '2020-06-01', '12:30')
-- boundary beyond/not defined
,( 3, 'EUR', 'GBP', '2020-06-01', '13:30')
-- boundary before
,( 4, 'EUR', 'GBP', '2020-03-01', '10:30')
-- boundary first at
,( 5, 'EUR', 'GBP', '2020-03-01', '11:30')
INSERT INTO
dbo.ExchangeRates
VALUES
(
1, 'EUR', 'GBP', '2020-03-01', '11:30', .6
)
, (
2, 'EUR', 'GBP', '2020-03-01', '18:30', .5
)
, (
3, 'EUR', 'GBP', '2020-06-01', '12:30', .55
);
-- Creating a temp table version of the above as the separate date and time fields will
-- crush performance at scale (so too might duplicating data as we're about to do)
SELECT
X.*
, CAST(CONCAT(X.TransactionDate, 'T', X.TransactionTime) AS datetime2(0)) AS IsThisWorking
INTO
#BankOperations
FROM
dbo.BankOperations AS X;
SELECT
X.*
, CAST(CONCAT(X.TransactionDate, 'T', X.TransactionTime) AS datetime2(0)) AS IsThisWorking
INTO
#ExchangeRates
FROM
dbo.ExchangeRates AS X;
-- Option A for pinning data
-- Outer apply will show use the TOP 1 to get the closest without going over
SELECT
BO.*
-- assuming surrogate key
, EX.CaseNumber
, EX.Rate
FROM
#BankOperations AS BO
OUTER APPLY
(
SELECT TOP 1 *
FROM
#ExchangeRates AS ER
WHERE
-- Match based on all of our keys
ER.Coin_1 = BO.Coin_1
AND ER.Coin_2 = BO.Coin_2
-- Eliminate
AND BO.IsThisWorking >= ER.IsThisWorking
ORDER BY
ER.IsThisWorking DESC
)EX
;
-- Option B
-- Use lead/lag function to get the value
-- but my brain isn't seeing it at the moment
/*
SELECT
BO.*
-- assuming surrogate key
, LAG()
FROM
#BankOperations AS BO
INNER JOIn #ExchangeRates
*/
如果我被迫提供纯粹基于SSIS的答案,那么我将使用查找组件,而不是默认的FULL Cache,我将在None中对其进行操作。对性能的影响是,对于进入缓冲区的每一行,我们将向源系统发起查询以检索一行数据。根据体积,这可能是“沉重的”。
作为源,您有一个指向BankOperations的OLE DB Source组件。它将进入一个Lookup,我们将对其进行参数化。
SELECT TOP 1 *
FROM
dbo.ExchangeRates AS ER
CROSS APPLY (SELECT CAST(CONCAT(ER.TransactionDate, 'T', ER.TransactionTime) AS datetime2(0)) AS IsThisWorking) ITW
WHERE
-- Match based on all of our keys
ER.Coin_1 = ?
AND ER.Coin_2 = ?
-- Eliminate what's too new
AND CAST(CONCAT(?, 'T', ?) AS datetime2(0)) >= ITW.IsThisWorking
ORDER BY
ITW.IsThisWorking DESC
所有?其中有从0开始的顺序特定的占位符。我们要做的是模仿原始查询的逻辑。全面披露,自从我完成了参数化的无/部分缓存查询以来,已经有很多年了,因此您必须仔细阅读一些要点。我记得的是,您将单击高级“材料”以使其起作用。
我看到的使用SSIS组件的另一种方法将涉及两个来源和一个联接。我认为是Matt Masson演示了此技术,但距我必须这样做已有多年了。同样,如果在源查询中执行此操作,则性能会更好,因为此方法将需要两种+ Join的阻塞转换。
最好的脚本组件方法将模拟参数化的Lookup组件方法。它保持同步(1行进,1行出),我们通过添加Rate列来丰富数据流。
伪代码大约
// make local variables with values from the row buffer
var coin_1 = Row.coin1;
var coin_2 = Row.coin2;
var transactionDate = Row.IsThisWorking;
// standard OLE DB parameterized query stuff here
using (SqlConnection conn = new SQLConnection)
{
conn.Open();
using(SqlCommand command = new SqlCommand())
{
command.Text = "SELECT TOP 1 ER.Rate FROM dbo.ExchangeRate AS ER WHERE @txnDate >= ER.IsThisWorking AND ER.Coin_1 = @coin1 AND ER.Coin_2 = @coin2;";
// I don't remember exact syntax
command.Parameters.AddWithValue("@txnDate", transactionDate);
command.Parameters.AddWithValue("@coin1", coin_1);
command.Parameters.AddWithValue("@coin2", coin_2);
}
}