使用SAS从网站提取数据

时间:2014-11-10 10:00:48

标签: sas

我想从具有登录功能的网站中提取数据密码,任何人都可以帮助我如何使用SAS提取,我尝试使用网址这么多的例子,但它没有工作,并给出错误消息“错误:主机名www.seer.cancer.gov未找到”。我有SAS EG(BASE SAS),请看下面的例子:

例1:

 filename seercode URL "http://www.seer.cancer.gov/siterecode/icdo3_d01272003/index.txt";
data siterecode;
infile seercode truncover;
input @1 bigline $char256.;
run;

例2:

FILENAME SOURCE URL "%STR(http://www.usatoday.com)" DEBUG;
    DATA SOURCE1;
    FORMAT WEBPAGE $1000.;
    INFILE SOURCE LRECL=32767 DELIMITER=">";
    INPUT WEBPAGE $ @@;
    RUN;

3 个答案:

答案 0 :(得分:1)

如果您的浏览器前面有代理服务器,则可能需要PROXY = proxyurl选项。 您可以直接在浏览器中查找proxyurl,也可以查看许多站点用来存储有关哪些站点应使用什么代理服务器的信息的wpad脚本。通常可以在此URL找到该脚本:

http://wpad/wpad.dat

答案 1 :(得分:0)

您可以使用' user'请参阅FILENAME sas语句中的选项http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000223242.htm

filename foo url 'https://www.b.com/file1.html' user='jones' prompt;

答案 2 :(得分:0)

我已经为这两个示例运行了代码。它运行得非常好(参见下面的日志文件)。我建议你应该尝试指定PROXY和PUSER以及PPASS选项。例如,在您的第一个代码中:

filename seercode URL "http://www.seer.cancer.gov/siterecode/icdo3_d01272003/index.txt" proxy='http://proxy.com' puser='login' ppass='pws'; 

在此处查看更多信息:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000223242.htm

以下是日志文件 例1:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 42         ;
 43          filename seercode URL    "http://www.seer.cancer.gov/siterecode/icdo3_d01272003/index.txt";
 44         data siterecode;
 45         infile seercode truncover;
 46         input @1 bigline $char256.;
 47         run;

 NOTE: The infile SEERCODE is:
       Filename=http://www.seer.cancer.gov/siterecode/icdo3_d01272003/index.txt,
   Local Host Name=localhost.localdomain,
   Local Host IP addr=::1,
   Service Hostname Name=www.seer.cancer.gov,
   Service IP addr=63.236.108.164,
   Service Name=httpd,Service Portno=80,
   Lrecl=32767,Recfm=Variable

 NOTE: 114 records were read from the infile SEERCODE.
   The minimum record length was 0.
   The maximum record length was 222.
NOTE: The data set WORK.SITERECODE has 114 observations and 1 variables.
NOTE: DATA statement used (Total process time):
   real time           5.65 seconds
   cpu time            0.04 seconds


 48         ;
 49         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 59         ;

示例2:

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 42         ;
 43         FILENAME SOURCE URL "%STR(http://www.usatoday.com)" DEBUG;
 44             DATA SOURCE1;
 45             FORMAT WEBPAGE $1000.;
 46             INFILE SOURCE LRECL=32767 DELIMITER=">";
 47             INPUT WEBPAGE $ @@;
 48             RUN;

 NOTE: >>> GET / HTTP/1.0
 NOTE: >>> Host: www.usatoday.com
 NOTE: >>> Accept: */*.
 NOTE: >>> Accept-Language: en
 NOTE: >>> Accept-Charset: iso-8859-1,*,utf-8
 NOTE: >>> User-Agent: SAS/URL
 NOTE: >>> 
 NOTE: <<< HTTP/1.0 200 OK
 NOTE: <<< Server: nginx/1.2.7
 NOTE: <<< Content-Type: text/html; charset=utf-8
 NOTE: <<< Content-Language: en
 NOTE: <<< Last-Modified: Mon, 10 Nov 2014 23:15:32 GMT
 NOTE: <<< X-Secret:     cnpudnkgcnpiZXZnbUBoZm5nYnFubC5wYnogbmFxIFYganZ5eSBnZWwgZ2IgdHJnIGxiaCBuIHdiby4=
 NOTE: <<< X-Gannett-Site-Version: 579.0
 NOTE: <<< X-UA-Compatible: IE=Edge,chrome=1
 NOTE: <<< Content-Length: 185753
 NOTE: <<< Cache-Control: max-age=60
 NOTE: <<< Expires: Mon, 10 Nov 2014 23:17:30 GMT
 NOTE: <<< Date: Mon, 10 Nov 2014 23:16:30 GMT
 NOTE: <<< Connection: close
 NOTE: <<< 
 NOTE: The infile SOURCE is:
   Filename=http://www.usatoday.com,
   Local Host Name=localhost.localdomain,
   Local Host IP addr=::1,
   Service Hostname Name=host-62-253-8-163.not-set-yet.virginmedia.net,
   Service IP addr=62.253.3.163,
   Service Name=httpd,Service Portno=80,
   Lrecl=32767,Recfm=Variable

 NOTE: 412 records were read from the infile SOURCE.
   The minimum record length was 0.
   The maximum record length was 32767.
   One or more lines were truncated.
 NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
 NOTE: The data set WORK.SOURCE1 has 3127 observations and 1 variables.
 NOTE: DATA statement used (Total process time):
   real time           0.31 seconds
   cpu time            0.05 seconds


 49         ;
 50         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 60         ;

此致 瓦西里