SAS - 删除一个完整变量的字符串中的重复单词

时间:2017-11-12 02:22:49

标签: sas

我有一个名为cat_d的字符串变量,其中包含每个观察中带有重复单词的观察结果。我如何删除每个观察的重复单词? 以下显示了链接中变量和数据的图像 variable cat_d

每次观察

样本数据:

MPSJ,Hulu Langat,Hulu Langat,MPAJ,MPSJ,MPAJ,Gombak,MPSJ,MPSJ,MPSJ,MPKJ,MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak

Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama

关丹

MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB

预期产出:

MPSJ,Hulu Langat,MPAJ,Gombak,MPKJ

Cheras,Setapak,Pusat Bandar,Klang Lama

关丹

MPJBT,MBJB,MPPG

data keep;
i=2;
length word $500;
do until (last.cat_d);
    set want;
    by cat_d notsorted;
    string=cat_d;
    do while(scan(string, i, ',') ^= '');
        word = scan(string, i, ',');
        do j = 1 to i - 1;
            if word = scan(string, j, ',') then do;
                start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
                string = cat(substr(string, 1, start - 2), substr(string, start + length(word)));
                leave;
            end;
        end;
        i = i + 1;
    end;
end;
keep cat_d string;run;

1 个答案:

答案 0 :(得分:2)

如果您希望上述方法有效,则应尝试使用TRANWRD删除单词,但您还必须处理逗号,并确保在必要时删除它们。最后一个之后没有逗号的人会怎么样?

这是一种完全不同的方法,但在我看来它更灵活。

  1. 计算每个变量中的单词数
  2. 将其分开,以便每个条目都在其自己的行上。通常,您可能会发现此结构更易于整体使用。
  3. 对数据集进行排序和重复数据删除
  4. 将其转置回广泛的数据集并重新创建句子。

    *Create sample data;
    
    data have;
        length x $200.;
        x="MPSJ,Hulu Langat,Hulu Langat, MPAJ, MPSJ, MPAJ, Gombak, MPSJ, MPSJ, MPSJ, MPKJ, MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak";
        output;
        x="Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama";
        output;
        x="Kuantan";
        output;
        x="MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB";
        output;
    run;
    
    *Make it into a long dataset;
    
    data long;
        set have;
        nwords=countw(x);
        ID=_n_;
    
        do i=1 to nwords;
            words=scan(x, i);
            output;
        end;
    run;
    
    *Sort and remove duplicate values;
    
    proc sort data=long nodupkey out=long_unique;
        by ID words;
    run;
    
    *Transpose to a wide format;
    
    proc transpose data=long_unique out=wide_unique prefix=word;
        by id;
        var words;
    run;
    
    *Make it back into one variable;
    
    data want;
        set wide_unique;
        by id;
        sentence=catx(", ", of word:);
    run;