Question

我试图找到一个将索引第n个字符实例的函数。

例如，如果我有字符串ABABABBABSSSDDEE并且我想找到A的第3个实例，我该怎么做？如果我想找到AB

的第4个实例，该怎么办？

ABAB的 A BB的 AB SSSDDEE

data HAVE;
   input STRING $;
   datalines;
ABABABBASSSDDEE
;
RUN;

Answer 1

data _null_;
findThis = 'A'; *** substring to find;
findIn = 'ADABAACABAAE'; **** the string to search;
instanceOf=1; *** and the instance of the substring we want to find;
pos = 0; 
len = 0; 
startHere = 1; 
endAt = length(findIn);
n = 0; *** count occurrences of the pattern;
pattern =  '/' || findThis || '/'; 
rx = prxparse(pattern);
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
if pos le 0 then do;
    put 'Could not find ' findThis ' in ' findIn;
end;
else do while (pos gt 0);
    n+1;
    if n eq instanceOf then leave;
    CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
end;
if n eq instanceOf then do;
    put 'found ' instanceOf 'th instance of ' findThis ' at position ' pos ' in ' findIn;
end;
else do;
    put 'No ' instanceOf 'th instance of ' findThis ' found';
end;
run;

Answer 2

这是一个使用find()函数和datastep中的do循环的解决方案。然后我接受该代码，并将其放入proc fcmp过程以创建我自己的函数find_n()。这应该大大简化了使用它的任何任务，并允许代码重用。

定义数据：

data have;  
  length string $50;
  input string $;
  datalines;
ABABABBABSSSDDEE
;
run;

Do-loop解决方案：

data want;
  set have;  
  search_term = 'AB';
  nth_time = 4;
  counter = 0;
  last_find = 0;

  start = 1;
  pos = find(string,search_term,'',start);
  do while (pos gt 0 and nth_time gt counter);
    last_find = pos;
    start = pos + 1;
    counter = counter + 1;
    pos = find(string,search_term,'',start+1);
  end;

  if nth_time eq counter then do;    
    put "The nth occurrence was found at position " last_find;
  end;
  else do;
    put "Could not find the nth occurrence";
  end;

run;

定义proc fcmp功能：

注意：如果找不到第n次出现，则返回0。

options cmplib=work.temp.temp;

proc fcmp outlib=work.temp.temp;

  function find_n(string $, search_term $, nth_time) ;    

    counter = 0;
    last_find = 0;

    start = 1;
    pos = find(string,search_term,'',start);
    do while (pos gt 0 and nth_time gt counter);
      last_find = pos;
      start = pos + 1;
      counter = counter + 1;
      pos = find(string,search_term,'',start+1);
    end;

    result = ifn(nth_time eq counter, last_find, 0);

    return (result);
  endsub;

run;

示例proc fcmp用法：

请注意，这会调用该函数两次。第一个示例显示原始请求解决方案。第二个例子显示了找不到匹配时会发生什么。

data want;
  set have;  
  nth_position = find_n(string, "AB", 4);
  put nth_position =;

  nth_position = find_n(string, "AB", 5);
  put nth_position =;
run;

Answer 3

我知道我在这里参加聚会迟到了，但是为了增加答案的范围，这是我想出的。

DATA test;
   input   = "ABABABBABSSSDDEE";

   A_3  = find(prxchange("s/A/#/",   2, input), "A");
   AB_4 = find(prxchange("s/AB/##/", 3, input), "AB");
RUN;

简而言之，prxchange()只是进行模式匹配替换，但是它的妙处在于您可以告诉它替换该模式多少次。因此，prxchange("s/A/#/", 2, input)用＃替换input中的前两个A。替换完前两个A后，可以将其包装在find()函数中以找到“第一个A”，它实际上是原始字符串的第三个A。

有关此方法的注意事项是，理想情况下，替换字符串的长度应与您要替换的字符串的长度相同。例如，请注意

prxchange("s/AB/##/", 3, input) /* gives 8 (correct) */

和

prxchange("s/AB/#/", 3, input)  /* gives 5 (incorrect) */

那是因为我们将长度为2的字符串替换为长度为1的字符串三次。换句话说：

(length("#") - length("AB")) * 3 = -3

所以8 + (-3) = 5。

希望能帮助某个人！

Answer 4

这是使用SAS find（）函数在SAS字符串中查找一组字符的第N个实例的简化的实现：

     data a;
        s='AB bhdf +BA s Ab fs ABC Nfm AB ';
        x='AB';
        n=3;

        /* from left to right */
        p = 0;
        do i=1 to n until(p=0); 
           p = find(s, x, p+1);
        end;
        put p=;

        /* from right to left */
        p = length(s) + 1;
        do i=1 to n until(p=0); 
           p = find(s, x, -p+1);
        end;
        put p=;
     run;

如您所见，它支持从左到右和从右到左的搜索。

您可以将两者结合成SAS用户定义的函数（负数n表示从右向左进行搜索，就像在查找函数中一样）：

     proc fcmp outlib=sasuser.functions.findnth;
        function findnth(str $, sub $, n);
           p = ifn(n>=0,0,length(str)+1);
           do i=1 to abs(n) until(p=0);
              p = find(str,sub,sign(n)*p+1);
           end;
           return (p);
        endsub;
     run;

请注意，以上使用FIND（）和FINDNTH（）函数的解决方案假定搜索的子字符串可以与其先前的实例重叠。例如，如果我们在字符串“ ABAAAA”中搜索子字符串“ AAA”，则将在位置3中找到“ AAA”的第一个实例，并在位置4中找到第二个实例。即，第一个和第二实例是重叠的。因此，当我们找到一个实例时，我们会将位置p增加1（p + 1），以开始下一次搜索迭代（实例）。但是，如果这样的重叠在您的搜索中不是有效的情况，并且您想在上一个子字符串实例结束后继续搜索，那么我们应该将p而不是1的长度增加，而是将x的长度增加。这将加快我们的搜索速度（子字符串x越长），因为我们在遍历字符串s时将跳过更多字符。在这种情况下，在我们的搜索代码中，我们应该将p + 1替换为p + w，其中w = length（x）。

我最近在SAS博客文章Finding n-th instance of a substring within a string中描述了对该问题的详细讨论。我还发现，使用find（）函数比使用SAS中的正则表达式函数要快得多。

SAS：如何在字符串中找到第n个字符/字符组实例？

4 个答案: