Question

我有一个如下所示的数据文件：

001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006

我发现将这个读入SAS的任务基本上是不可能的。不，在这种情况下，重新格式化数据文件是不可能的。那么让我解释一下你在这里看到的内容：

每个科目都有两行数据。第一行是 -

受试者编号/诊所/ wt / hr / dx / sx （不要担心数字的含义，那是无关紧要的）。

第二行是文本，它基本上是一个包含额外信息的注释，这些信息涉及其数据在前一行中布局的主题。所以，行：

001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks

适用于单一主题。主题001.这些需要成为SAS数据集中的单行。我完全不知所措;由于诊所名称的长度不同，而且列数没有对齐，我无法弄清楚如何让SAS读取它。这是我能得到的最接近的：

data ClinData;
    infile "&wdir.clinic_data.txt";
    retain patno clinic weight hr dx sx exinfo;
    input patno clinic $1. @;
    if clinic='M' then
        input patno @5 clinic $11. weight hr dx sx / @1 exinfo $30.;
    else if clinic='H' then
        input patno @5 clinic $3. weight hr dx sx / @1 exinfo $30.;
    run;

打印为：

http://i61.tinypic.com/2uswl90.png

所有数值都在正确的位置。

然而，这有几个问题。

首先，主题编号（'patno'）始终显示为缺失值。为什么？

其次，诊所仅以其第一个字母“M”或“H”表示。我不能让SAS根据它所在的诊所改变诊所变量的长度。

第三，变量“exinfo”包含有关患者的注释。但是，我无法让SAS包含整条生产线。在格式化失败之前，我能得到的最高值是大约30个字符。

有任何帮助吗？对于这种类型的输入，SAS文档令人沮丧。这些例子都没有真正符合我的要求，也没有充分解释如何使用某些选项。我知道我需要使用列/行指针;但问题是列之间的行不一致。因此，无论我使用哪种指针格式，仍然会出现不正确的行。

Answer 1

SAS中没有任何东西是不可能的。查看您的样本数据，我注意到您的诊所名称后面有两个空白，并且您的患者编号总是三个字符。如果这始终是真的，那么你可以利用它：

data want;
  length patno $3 clinic $20 weight hr dx sx 8 exinfo $80;
  input;
  patno  = scan(_infile_,1,' ');
  clinic = substr(_infile_,5,index(_infile_,'  ')-5);
  weight = input(scan(_infile_,-4,' '),8.);
  hr     = input(scan(_infile_,-3,' '),8.);
  dx     = input(scan(_infile_,-2,' '),8.);
  sx     = input(scan(_infile_,-1,' '),8.);
  input exinfo $80.;

datalines;
001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006
run;

基本上这是解析自动变量 _INFILE _ 来读取每个变量。 “硬”部分正在罚款如何阅读诊所名称（因为它包含嵌入的空白）。如果诊所并不总是有双重空白，您仍然可以使用其他substr，index和/或scan函数进行操作。如果是这样的话，我会把它留给你。

此外，在创建新数据集时，始终使用长度语句定义变量，以确保它们具有正确的长度，尤其是对于字符变量。

Answer 2

您正在以奇怪的方式混合输入类型，而这些输入类型无法正确读取。

你的诊所是1长，因为你把它输入为一个字符，将其定义为1.不要这样做 - 如果需要，可以使用一次性变量 - 并将其长度定义为更长的时间。

我建议采用如下方法。使用 INFILE （在包含一行数据的输入期间创建的自动变量）更容易，而不是仅仅尝试使用输入技术。您的数据非常简单;如果它比你提供的更复杂（例如你有比这更多的诊所），正则表达式或其他逻辑可能会进一步帮助 - 并且 infile 将更容易解析。还有ANYDIGIT和NODIGIT以及类似的功能，加上COMPRESS，这可能有所帮助。

data want;
length clinic $12;
input 
@1 patid 3. @;  *hold input so _infile_ exists and we can play with it.  Might as well read in patid here.;
array numvars weight hr dx sx; *we are going to read this in via array;
do _t = 4 to 1 by -1;  *we are going through the string in backwards order;
 numvars[_t] = scan(_infile_,(_t-5),' '); *(_t-5) is giving us 4 -> -1 3 -> -2 etc.- I include space explicitly here as I think period otherwise might count which is bad;
end;
clinic = scan(_infile_,2); *start out using the 2nd word;
if scan(_infile_,3) = 'Clinic' then clinic=catx(' ',clinic,scan(_infile_,3)); *then maybe add the third word.  Here you could also check if compress(scan(_infile_,3),,'ka') is not missing;
input;
input @1 exinfo $50.;
put _all_;
datalines;
001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006
;;;;
run;

Answer 3

您遇到的大多数问题都是因为您已明确声明的长度。例如，Clinic在初始输入语句中定义为$ 1，并且在第二个输入行中尝试时无法修改事件后的长度。

这可以让你更接近你想要的东西：

data ClinData(drop=s varlen);
  retain patno clinic weight hr dx sx; 

  input patno clinic $30. @;
    clinic=compress(clinic,,'ka');
    s=length(clinic)+4+2;
   input @s weight hr dx sx /@; 
     varlen=length(_infile_); 
    input  @1 exinfo $varying256. varlen;

datalines4;
001 Mayo Clinic  120 78 7 15 
Patient has had a persistent cough for 3 weeks
023 Mayo Clinic  157 72 10 2 
Patient complained of ear ache
064 HMC  201 59 . . 
Patient left against medical advice
003 HMC  166 58 8 15 
Patient placed on beta-blockers on 7/1/2006
;;;;
run; 
proc print data=ClinData; run;

将数据读入SAS，未对齐的列

3 个答案: