我正在使用SAS作为大型数据集(> 20gb)。当我运行DATA步骤时,我收到了" BY变量没有正确排序......"虽然我用相同的变量对数据集进行了排序。当我再次运行PROC SORT时,SAS甚至说"输入数据集已经排序,未完成排序" 我的代码是:
proc sort data=output.TAQ;
by market ric date miliseconds descending type order;
run;
options nomprint;
data markers (keep=market ric date miliseconds type order);
set output.TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;
错误信息是:
ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ.
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400
AskPrice=9.04 AskSize=100 Qualifiers= order=116458952 Miliseconds=49638125 exchange=CCP market=1
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1
_N_=43297873
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ.
WARNING: The data set WORK.MARKERS may be incomplete. When this step was stopped there were
56770826 observations and 6 variables.
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 1:14.21
cpu time 26.71 seconds
答案 0 :(得分:1)
错误发生在数据步骤的深处,_N_=43297873
。这告诉我PROC SORT
正在努力达到某一点,但随后失败了。如果不了解您的SAS环境或OUTPUT.TAQ
的存储方式,很难知道原因是什么。
有些人在对大型数据集进行排序时报告了资源问题或文件系统限制。
来自SAS FAQ: Sorting Very Large Datasets with SAS(非官方来源):
在WORK文件夹中排序时,您的空闲存储空间必须等于数据集大小的4倍(如果在Unix下,则为5倍)
您的内存可能已用完
您可以使用选项MSGLEVEL=i
和FULLSTIMER
来获得更全面的图片
同样使用options sastraceloc=saslog;
可以产生有用的信息。
也许不是对它进行排序,而是将其分解为几个步骤,例如:
/* Get your market ~ ric ~ date pairs */
proc sql;
create table market_ric_date as
select distinct market, ric, date
from output.TAQ
/* Possibly an order by clause here on market, ric, date */
; quit;
data millisecond_stuff;
set market_ric_date;
*Possibly add type/order in this step as well?;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;
run;
/* Possibly a third step here to add type / order if you need to get from original data source */
答案 1 :(得分:0)
如果源数据集位于数据库中,则可以使用其他排序规则对其进行排序。
在排序前尝试以下操作:
options sortpgm=sas;
答案 2 :(得分:0)
我遇到了同样的错误,解决方案是在工作目录中复制原始表,进行排序,然后“ by”起作用。
在您的情况下,如下所示:
data tmp_TAQ;
set output.TAQ;
run;
proc sort data=tmp_TAQ;
by market ric date miliseconds descending type order;
run;
data markers (keep=market ric date miliseconds type order);
set tmp_TAQ;
by market ric date;
if first.date;
* ie do the following once per stock-day;
* Make 1-second markers;
/*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;