我正在尝试使用单词表从SAS字段中删除单词。
我已经能够使用我在网上找到的一些代码来隔离每个单词,但是我无法从该字段中删除该单词。
例如,如果字段为:
“狐狸跳上了月亮”
如果单词“ jumped”在单词列表中,则结果应类似于:
“月亮上的狐狸”
这是要删除的停用词表:
PROC SQL;
CREATE TABLE BOW.QUERY_FOR_STOPWORDS AS
SELECT t1.StopWords
FROM BOW.STOPWORDS t1;
QUIT;
这是带有要删除的字段的表:
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_ANNU_COMMENTS AS
SELECT t1.Comment
FROM BOW.ANNU_COMMENTS t1;
QUIT;
答案 0 :(得分:1)
Depending off how much words you have other solutions.
data _NULL_;
set STOPWORDS end=e;
if _N_=1 then call execute('data result;set ANNU_COMMENTS;newComment=Comment;');
call execute('if _N_=1 then __'||put(_N_,z30.)||'+prxparse("s/'||trimn(StopWords)||'//");');
call execute('call prxchange(__'||put(_N_,z30.)||',-1,newComment);');
if e then call execute('drop __:;run;');
run;
This will take stopwords an generate datastep from it than this datastep process comments.
EDIT: To remove only words by word boundary you have to use \b in the regex.
data _NULL_;
set STOPWORDS end=e;
if _N_=1 then call execute('data result;set ANNU_COMMENTS;newComment=Comment;');
call execute('if _N_=1 then __'||put(_N_,z30.)||'+prxparse("s/\b'||trimn(StopWords)||'\b//");');
call execute('call prxchange(__'||put(_N_,z30.)||',-1,newComment);');
if e then call execute('drop __:;run;');
run;
答案 1 :(得分:0)
基本思想是replace()
:
SELECT REPLACE(t1.Comment, 'jumped', '')
FROM BOW.ANNU_COMMENTS t1;
但是,您在使用空格时遇到了问题。如果这是一个问题,并且您想要完整的单词,那么这可能会起作用:
SELECT TRIM(BOTH ' ' FROM REPLACE(' ' || t1.Comment || ' ', ' jumped ', ''))
FROM BOW.ANNU_COMMENTS t1;
答案 2 :(得分:0)
您可以尝试使用哈希迭代器,例如:
data want;
if 0 then set STOPWORDS;
if _n_=1 then do;
declare hash h(dataset:'STOPWORDS');
declare hiter iter('h');
h.definekey('StopWords');
h.definedata('StopWords');
h.definedone();
end;
set ANNU_COMMENTS;
rc=iter.first();
do while(rc=0);
newComment=ifc(findw(newComment,strip(StopWords))>0,tranwrd(newComment,strip(StopWords),''),newComment);
rc=iter.next();
end;
drop StopWords rc;
run;
答案 3 :(得分:0)
可以编写一个宏来生成清洁数据步骤。
此示例将圣林对tranwrd的使用与Lee的代码源混合在一起。
%macro flense (
data=Commments,
var=Comment,
newvar=CommentFlensed,
censor=BOW.StopWordsList,
term=StopWords
);
proc sql noprint;
select quote(trim(&term),"'") into :sq_word1-; * single quote the words to prevent possible macro evaluation later;
from &censorData;
data &out;
set &data;
&newvar = &var;
%* for each censored word, generate an if statement
* that checks if the word (or term) is present, and if so
* removes the word from the new variable;
%local i quoted_word;
%do i = 1 %to &SQLOBS;
%let quoted_word = &&&sq_word&i;
if (indexw(&newvar.,"ed_word)) then
&newvar = tranwrd(&newvar,"ed_word,'');
%end;
run;
%mend;
%flense();