我想看看这在SAS中是否可行。我有国会议员的数据集,并希望将全名分为第一名和最后一名。但是,有时他们似乎会列出他们的中间名缩写或名字。它来自.txt文件。
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
答案 0 :(得分:0)
美好的一天
在字符串方面,SAS有点笨拙。但是可以做到的。正如其他人提到的,这是定义的逻辑,这是真正困难的部分。
从一些原始数据开始...
data begin;
input raw_str $ 1-100;
cards;
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
; run;
首先,我选择领先的名字,直到第一个括号。
计算字符串数
data names;
set begin;
names_only = scan(raw_str,1,'[');
Nr_of_str = countw(names_only,' ');
run;
假设:姓氏是姓氏。
如果只有2个字符串,那么使用scan和substring很容易做到第一个和最后一个:
data names2;
set names;
if Nr_of_str = 2 then do;
last_name = scan(names_only, 1, ' ');
_FirstBlank = find(names_only, ' ');
first_name = strip(substr(names_only, _FirstBlank));
end;
run;
假设:只有3个字符串。 方法1.中间名中带有点。过滤掉。 方法2.中间名短于真实名:
data names3;
set names2;
if Nr_of_str > 2 then do;
last_name = scan(names_only, 1, ' '); /*this should still hold*/
_FirstBlank = find(names_only, ' '); /*Substring approach */
first_name = strip(substr(names_only, _FirstBlank));
second_str = scan(names_only, 2, ' ');
third_str = scan(names_only, 3, ' ');
if find(second_str,'.') = 0 then /*1st approch */
first_name = scan(names_only, 2, ' ');
else
first_name = scan(names_only, 3, ' ');
if len(second_str) > len(second_str) then /*2nd approch */
first_name = second_str;
else
first_name = third_str;
end;
run;