我有两张桌子。表A包含2004年至2012年公司债券交易的每日信息,表B包含特定日期的债券评级信息。我需要加入这两个表,以便对于表A中的每个事务,附加该特定债券的最新评级。
Table A: daily_transactions
--------------------------------------------
DATE |BOND |PRICE
--------------------------------------------
20110401 |AES |100
20110402 |AES |101
20110403 |AES |102
20110404 |AES |103
20110401 |BPP |99
20110402 |BPP |98
Table B: bond_ratings
--------------------------------------------
DATE |BOND |RATING
--------------------------------------------
20110401 |AES |AAA
20110403 |AES |BB
20110401 |BPP |CCC
Table C: joined_data
--------------------------------------------
DATE |BOND |PRICE |RATING
--------------------------------------------
20110401 |AES |100 |AAA
20110402 |AES |101 |AAA
20110403 |AES |102 |BB
20110404 |AES |103 |BB
20110401 |BPP |99 |CCC
20110402 |BPP |98 |CCC
我有约。表A中有1,000,000条记录,表B中有14,000条记录。
更新
到目前为止我所拥有的是:
create table test_merge as
SELECT a.date, b.date, a.bond, a.price, b.rating
FROM daily_transactions a
LEFT JOIN bond_ratings b ON a.bond = b.bond AND b.date <= a.date
WHERE NOT EXISTS (
SELECT 1 FROM bond_ratings b1
WHERE b1.bond = a.bond
AND b1.date <= a.date
AND b1.date > b.date
);
它似乎工作得很好(http://sqlfiddle.com/#!3/d287f/2),但是由于我拥有的数据量,它的运行速度非常慢。大约需要2个小时。有没有办法优化它以更快地运行?
我非常(非常)对sql很新,因此非常感谢任何帮助!
答案 0 :(得分:1)
对于更基于SAS的方法(而不是SQL),您可以使用表格B的SAS格式,并可能加快速度。 format in SAS只是一个查找表,将START和END之间的任何内容映射到LABEL。例如,将此表格加载为格式:
fmtname | START | END | LABEL
-----------------------------------------------------------
$bondRate | AES20110401 | AES20110403 | AAA
将START和END之间的任何文本字符串映射到LABEL。所以AES20110302
- &gt; AAA。
以下是完整代码,使用上面的表B(假设DATE是数字字段,如果不使用input(DATE,YYDDMMN8.)
将其转换为数字):
PROC SORT DATA = TABLE_B;
by bond descending date;
run;
/*Use lag function to get the start and end date on one line*/
data bond_ratings_fmt;
set TABLE_B;
by bond descending date;
START_DT = put(date,$8);*Character date like '20110401';
END_DT = put(lag(date)-1,$8);* 1 less than the prior records end;
*first.bond is the most recent rating for each bond;
*setting the END_DT to some future date in this case.;
if first.bond then END_DT= '20991231';
START = cats(BOND,START_DT);*Cats concatenates and trims spaces, makes AES20110401;
END = cats(BOND,END_DT);
LABEL = Rating;
fmtName='$bondRate';
run;
*Load the format, using CNTLIN (Control Table In);
proc format cntlin=bond_ratings_fmt;
*Apply the format;
data TableC_withRating (drop=_:);
set TableA;
_DateChar = put(DATE,$8.);
Rating = put(BOND||_DateChar,$bondRate.);
run;
您可以通过在格式中添加OTHER案例来获得更多优势 - 网上有很多关于cntlin
和proc format
的好例子。
答案 1 :(得分:0)
我怀疑在您的情况下,子查询会破坏性能。
以下方法避免了子查询使连接过程更有效。
/*sample data:*/
DATA daily_transactions;
input date bond $ price;
informat date yymmdd8.;
format date yymmddn8.;
infile datalines dsd delimiter = '|';
datalines;
20110401|AES|100
20110402|AES|101
20110403|AES|102
20110404|AES|103
20110401|BPP|99
20110402|BPP|98
;
run;
DATA bond_ratings;
input date bond $ rating $;
informat date yymmdd8.;
format date yymmddn8.;
infile datalines dsd delimiter = '|';
datalines;
20110401|AES |AAA
20110403|AES |BB
20110401|BPP |CCC
;
run;
/*Modify the bond_ratings dataset such that for each record we can specify up till when that rating is valid*/
/*essentially we will have two date fields (from_date, to_date)
from_date bond rating to_date
20110401 AES AAA 20110402
20110403 AES BB .
20110401 BPP CCC .
*/
/*since there is no LEAD function in SAS, we sort in decending order by date and apply the LAG function - in effect getting the leading value*/
PROC SORT DATA = bond_ratings OUT = bond_ratings_sorted;
by bond descending date;
run;
/*capture the to_date by using lag function on the date.*/
data bond_ratings_lookup(rename = (date=from_date));
set bond_ratings_sorted;
by bond descending date;
format to_date yymmddn8.;
lag_date = lag(date);/*note: the reason we keep lag function outside the if-else group below because of the way lag-function works-just look it on google*/
if first.bond and first.date then to_date =.;
else to_date=lag_date-1;/*-1, so that to_date is set to 1 day less the next available bond rating date*/
drop lag_date;
run;
/*this sort is not necessary, but if you want to just verify the output then it is usefull*/
proc sort data = bond_ratings_lookup out = bond_ratings_lookup_sorted;
by bond from_date;
run;
/*final query:*/
proc sql;
create table joined as
select a.*, b.rating, b.from_date as bond_rating_start_period, b.to_date as bond_rating_end_period
from daily_transactions as a
left join bond_ratings_lookup_sorted as b
on a.bond = b.bond and
(
b.to_date ne . and (a.date >=b.from_date and a.date<= b.to_date )
or
b.to_date = . and (a.date >=b.from_date )
)
order by a.bond, a.date, b.from_date
;
quit;
答案 2 :(得分:0)
我设法通过在bond
列上建立索引来将运行时间缩短到5分钟。
proc sql;
create index bond
on work.daily_transactions(bond);
quit;
proc sql;
create index bond
on work.bond_ratings(bond);
quit;