我有一个名为cat_d的字符串变量,其中包含每个观察中带有重复单词的观察结果。我如何删除每个观察的重复单词? 以下显示了链接中变量和数据的图像 variable cat_d
每次观察样本数据:
MPSJ,Hulu Langat,Hulu Langat,MPAJ,MPSJ,MPAJ,Gombak,MPSJ,MPSJ,MPSJ,MPKJ,MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,GombakCheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama
关丹
MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB
预期产出:
MPSJ,Hulu Langat,MPAJ,Gombak,MPKJ
Cheras,Setapak,Pusat Bandar,Klang Lama
关丹
MPJBT,MBJB,MPPG
data keep;
i=2;
length word $500;
do until (last.cat_d);
set want;
by cat_d notsorted;
string=cat_d;
do while(scan(string, i, ',') ^= '');
word = scan(string, i, ',');
do j = 1 to i - 1;
if word = scan(string, j, ',') then do;
start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
string = cat(substr(string, 1, start - 2), substr(string, start + length(word)));
leave;
end;
end;
i = i + 1;
end;
end;
keep cat_d string;run;
答案 0 :(得分:2)
如果您希望上述方法有效,则应尝试使用TRANWRD删除单词,但您还必须处理逗号,并确保在必要时删除它们。最后一个之后没有逗号的人会怎么样?
这是一种完全不同的方法,但在我看来它更灵活。
将其转置回广泛的数据集并重新创建句子。
*Create sample data;
data have;
length x $200.;
x="MPSJ,Hulu Langat,Hulu Langat, MPAJ, MPSJ, MPAJ, Gombak, MPSJ, MPSJ, MPSJ, MPKJ, MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak";
output;
x="Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama";
output;
x="Kuantan";
output;
x="MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB";
output;
run;
*Make it into a long dataset;
data long;
set have;
nwords=countw(x);
ID=_n_;
do i=1 to nwords;
words=scan(x, i);
output;
end;
run;
*Sort and remove duplicate values;
proc sort data=long nodupkey out=long_unique;
by ID words;
run;
*Transpose to a wide format;
proc transpose data=long_unique out=wide_unique prefix=word;
by id;
var words;
run;
*Make it back into one variable;
data want;
set wide_unique;
by id;
sentence=catx(", ", of word:);
run;