问题陈述:
我正在尝试在SAS中读取以下文件(以下是我试图阅读的文件内容)
MicrosoftBillGates1976
AppleSteaveJob1975
GoogleLarryPage2004
FacebookMarkZukerberg2004
TwitterBizStone2006
我尝试过以下代码:
DATA CN;
INFILE 'W:\NMIMS\Sem 1\SAS\Datasets\CN.txt';
length Founder $10;
INPUT Name $1-9 Founder$10-23 Founded $24-29 ;
RUN;
PROC PRINT DATA = CN;
RUN;
但到目前为止还没有运气。 有人可以帮我吗?并给出一些解释。
答案 0 :(得分:2)
您可以使用正则表达式匹配来检测正确单词的开头:
data want;
input @;
array c[3] $16. Company Firstname Lastname;
retain regex;
if _n_ = 1 then regex = prxparse('/[A-Z][a-z]+/');
start = 1;
stop = length(_infile_);
do i = 1 to 3;
call prxnext(regex,start, stop, _infile_, position,length);
c[i] = substr(_infile_,position,length);
end;
Year = input(substr(_infile_,position + length),8.);
input;
keep Company Firstname Lastname Year;
cards;
MicrosoftBillGates1976
AppleSteaveJob1975
GoogleLarryPage2004
FacebookMarkZukerberg2004
TwitterBizStone2006
;
run;
如果您的数据源不一致正确,则可能更容易向人们付费以手动将其转录到单独的字段中。
答案 1 :(得分:0)
最终,您需要创建一个使用ANYDIGIT()和ANYUPPER()解析每一行的宏。
在这里提出并回答了类似的问题:https://communities.sas.com/t5/Base-SAS-Programming/Split-the-string-into-two-parts/td-p/133091
我希望这有帮助!
答案 2 :(得分:0)
假设:现场区分是由CamelCase,其中公司是第一个驼峰,名称是剩余的字母字符,后跟年份。
输入原始行并进行处理。
indexc
来隔离重要驼峰的开始。实施例
data want;
input;
* presume field differentiation is only know via camel-case;
index1 = indexc(_infile_, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ');
index2 = indexc(substr(_infile_,index1+1), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ') + 1;
index3 = indexc(substr(_infile_,index2+1), '1234567890') + index2;
length company person $50 year 8;
company = substr(_infile_,index1,index2-index1);
person = substr(_infile_,index2,index3-index2);
year = input (substr(_infile_, index3), 4.);
* or regex way;
if _n_ = 1 then do;
rx = prxparse ('/^([A-Z][^A-Z]*)([^0-9]*)(\d{4})\s*$/');
retain rx;
end;
if prxmatch (rx, _infile_) then do;
length company2 person2 $50 year2 8;
company2 = prxposn(rx,1,_infile_);
person2 = prxposn(rx,2,_infile_);
year2 = input (prxposn(rx,3,_infile_), 4.);
end;
datalines;
MicrosoftBillGates1976
AppleSteaveJob1975
GoogleLarryPage2004
FacebookMarkZukerberg2004
TwitterBizStone2006
run;
答案 3 :(得分:0)
使用标准化分隔文件进行数据处理始终是最佳做法。
解决方案1:Datalines
data output1;
input line $40.;
p = anyupper(line,2);
d=anydigit(line);
f=anyupper(substrn(line,p,d-p),2);
Company = substrn(line,1,p-1);
name=substrn(line,p,d-p);
Forename = substrn(substrn(line,p,d-p),1,f-1);
Surname =substrn(substrn(line,p,d-p),f);
Year=substrn(line,d);
drop line p d f name ;
put _all_;
cards;
MicrosoftBillGates1976
AppleSteaveJob1975
GoogleLarryPage2004
FacebookMarkZukerberg2004
TwitterBizStone2006
;;;;
run;
解决方案2:外部文件
/*Read External File*/
DATA WORK.cn;
LENGTH
F1 $ 25 ;
FORMAT
F1 $CHAR25. ;
INFORMAT
F1 $CHAR25. ;
INFILE 'E:\saswork\cn.txt' /* Change this to your file path*/
LRECL=25
ENCODING="WLATIN1"
TERMSTR=CRLF
DLM='7F'x /*I am using comma delimiter "," */
MISSOVER
DSD ;
INPUT
F1 : $CHAR25. ;
RUN;
/*Parse Columns and save data*/
data output2;
set work.cn;
p = anyupper(f1,2);
d=anydigit(f1);
f=anyupper(substrn(f1,p,d-p),2);
Company = substrn(f1,1,p-1);
name=substrn(f1,p,d-p);
Forename = substrn(substrn(f1,p,d-p),1,f-1);
Surname =substrn(substrn(f1,p,d-p),f);
Year=substrn(f1,d);
drop f1 p d f name ;
put _all_;
run;
输出:
Company=Microsoft name=BillGates Forename=Bill Surname=Gates Year=1976
Company=Apple name=SteaveJob Forename=Steave Surname=Job Year=1975
Company=Google name=LarryPage Forename=Larry Surname=Page Year=2004
Company=Facebook name=MarkZukerberg Forename=Mark Surname=Zukerberg Year=2004
Company=Twitter name=BizStone Forename=Biz Surname=Stone Year=2006