理想情况下,我正在尝试查找相似的短语。我在数据集中有两个词组,每个词组最多5-6个字。我使用了complev,compged等的Fuzzy匹配。由于它主要是字符串匹配,有时我仅通过阅读短语就无法实现匹配。短语没有拼写错误,但有时会缩短单词,例如“ Replacement to Replace”等,并重新排列单词,例如:“ Electric Component”键盘替换和“ Inward Component”键盘内部替换。类似于以下示例:
DATA COMPONENT;
infile datalines delimiter=',';
length FIRST $ 1000 FIRST_B $ 1000;
INPUT FIRST $ FIRST_B $;
DATALINES;
Electric Component keyboard replacement, Keyboard inward component replace
Electric Component keyboard replacement, Monitor Component Replacement
Electric Component keyboard replacement, Mouse component
Electric Component keyboard replacement, Wire Replacement
Electric Component keyboard replacement, PIN part
;
DATA Compged;
SET COMPONENT;
CALL COMPCOST('SWAP=', 5, 'P=', 0, 'INS=', 10,'DEL=',10,'APPEND=',5);
First_COMPGED=COMPGED(FIRST, FIRST_B, 'iln');
RUN;
PROC SORT DATA= Compged;
BY First_COMPGED;
RUN;
由于仅此一项不匹配,所以我想使用另一个因素来尝试查找被用作另一个因素的相同单词。因此要拆分成单词并进行比较。出现多少个常用字词,并将其添加为附加因素。
/* Tried this approach*/
proc iml;
s = "Introduction,to SAS/IML... programming!";
delims = ' ,.!';
n = countw(s, delims);
words = scan(s, 1:n, delims); /* pass parameter vector: create vector of
words */
print words;
不确定如何在当前表中实现此功能,以从短语first和first_b中获取单词和words_b。 请建议上面的示例是否还有其他方法可以实现?