Question

我有一个聊天数据，我想在当时的一个条目中阅读。每次一个人打“发送”应该是一个观察。问题是文本中有中断（输入）。我无法让SAS继续阅读这个观点。这是一些虚拟数据：

   08:23 - Greg: Hi!
   08:24 - Sue: Hello
   08:24 - Greg: How are you?
   08:25 - Sue: Just fine :)

   How are you then?
   08:26 - Greg: All good.

我希望这是5次观察，但我只能管理SAS将其视为7个障碍物。所需的数据集应如下所示：

Obs   VAR1
1    08:23 - Greg: Hi!
2    08:24 - Sue: Hello
3    08:24 - Greg: How are you?
4    08:25 - Sue: Just fine :) How are you then?
5    08:26 - Greg: All good.

我玩代码：

data testing;
infile datalines ;
input var1 $60. ;
datalines;
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)

How are you then?
08:26 - Greg: All good. 
;

但实际文件是一个txt，并且比上面的虚拟示例有更多的不规则性。我试图使用尾随@但不能让它以我想要的方式工作。也许尾随@不是我追求的。有什么建议怎么办？

Answer 1

试试这个。

保留一个最后一个值的运行变量。如果当前值的前4个字符中有时间戳，则输出该值并将值重置为“”。将当前值附加到运行变量。最后，输出最后一行，无论如何。

data testing(keep=line);
set testing end=last;

format line $2000.;
retain line;

if _n_ > 1 then do;
    if index(substr(var1,1,4),":") then do;
        output;
        line = "";
    end;
end;

put line= var1=;
line = catx(" ",line , var1);
put line=;

if last then do;
    output;
    put "AT LAST";
end;
run;

Answer 2

我无意中尝试在行数据输入中找到解决方案，无论如何我希望这对你有用，后期处理字符串：

data testing;
infile datalines ;
input var1 $60.;
datalines;
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)

How are you then?
08:26 - Greg: All good. 
;

data testing01;
set testing;
retain row 0;
if input(substr(var1,1,2),8.) le 24 and input(substr(var1,1,2),8.) ne .
and substr(var1,3,1)=':' 
and input(substr(var1,4,2),8.) le 59 and input(substr(var1,4,2),8.) ne . then row = row+1; else row=row;
run;

proc transpose data=testing01 out=testing02;
var var1;
by row;
run;

data testing03;
length final $2000;
set testing02;
array str[*] col:;
do i=1 to dim(str);
if str[i] ne '' then final=cats(strip(final)||' '||strip(str[i]));
end; 
drop col: row i _name_;
run;

Answer 3

filename FT15F001 temp;  
data testing ;
infile FT15F001 end=eof ;
length string $6323;
retain string;
input @;
if _n_=1 then string=_infile_;
else if not missing(_infile_) and anydigit(_infile_)^=1 then string=catx(' ',string,_infile_);
else if not missing(_infile_) and anydigit(_infile_)=1 then do;
   output;
   call missing(string);
   string=_infile_;
end;
if eof then output;
PARMCARDS;
08:23 - Greg: Hi!
08:24 - Sue: Hello
08:24 - Greg: How are you?
08:25 - Sue: Just fine :)

How are you then?
08:26 - Greg: All good. 
;

Answer 4

根据您的具体使用情况，有很多方法可以做到这一点。

这是一个正则表达式。如果你有＆gt;这将不起作用。总共32767个字符，除非你有办法将它分成块，但对于较小的文件效果很好;即使你一次读一行，也可以使用一般方法。

data test;
infile "c:\temp\chat.txt" recfm=f lrecl=32767;
input @;
rx_find = prxparse('~(\d\d:\d\d -.*?)(?=(?:\b\d\d:\d\d)|$)~ios');
rc_find = prxmatch(rx_find,_infile_);
pos=1;
pos2=0;
start=1;
call prxposn(rx_find,1,pos,len);
do until (pos2=0);
    call prxposn(rx_find,1,pos,len);
    found=substr(_infile_,pos,len);
    output;
    start=pos+len;
    call prxnext(rx_find,start,-1,_infile_,pos2,len2);
end;
stop;
run;

继续读取同一变量中下一行的输入

4 个答案: