Question

我需要SAS的大师的建议:) 假设我有两个大数据集。第一个是庞大的数据集（约50-100Gb！），其中包含电话号码。第二个包含前缀（20-40万个观察值）。我需要为每个电话号码的第一个表添加最合适的前缀。

例如，如果我有一个电话号码+71230000和前缀

+7
+71230
+7123

最合适的前缀是+71230。

我的想法。首先，对前缀表进行排序。然后在数据步骤中，处理电话号码表

data OutputTable;
    set PhoneNumbersTable end=_last;
    if _N_ = 1 then do;
        dsid = open('PrefixTable');
    end;
    /* for each observation in PhoneNumbersTable:
       1. Take the first digit of phone number (`+7`).
          Look it up in PrefixTable. Store a number of observation of
          this prefix (`n_obs`).
       2. Take the first TWO digits of the phone number (`+71`).
          Look it up in PrefixTable, starting with `n_obs + 1` observation.
          Stop when we will find this prefix
          (then store a number of observation of this prefix) or
          when the first digit will change (then previous one was the
          most appropriate prefix).
       etc....
    */
    if _last then do;
        rc = close(dsid);
    end;
run;

我希望我的想法足够清楚，但如果不是，我很抱歉。

那你有什么建议？谢谢你的帮助。

P.S。当然，第一个表中的电话号码不是唯一的（可能会重复），遗憾的是，我的算法不使用它。

Answer 1

有两种方法可以做到这一点，你可以使用格式或哈希表。

使用格式的示例：

/* Build a simple format of all prefixes, and determine max prefix length */
data prefix_fmt ;
  set prefixtable end=eof ;
  retain fmtname 'PREFIX' type 'C' maxlen . ;
  maxlen = max(maxlen,length(prefix)) ; /* Store maximum prefix length */
  start = prefix ;
  label = 'Y' ;
  output ;
  if eof then do ;
    hlo = 'O' ;
    label = 'N' ;
    output ;

    call symputx('MAXPL',maxlen) ;
  end ;

  drop maxlen ;
run ;
proc format cntlin=prefix_fmt ; run ; 

/* For each phone number, start with full number and reduce by 1 digit until prefix match found */
/* For efficiency, initially reduce phone number to length of max prefix */
data match_prefix ;
  set phonenumberstable ;

  length prefix $&MAXPL.. ;

  prefix = '' ;
  pnum = substr(phonenumber,1,&MAXPL) ;

  do until (not missing(prefix) or length(pnum) = 1) ;
    if put(pnum,$PREFIX.) = 'Y' then prefix = pnum ;
    pnum = substr(pnum,1,length(pnum)-1) ; /* Drop last digit */
  end ;
  drop pnum ;
run ;

Answer 2

这是另一种解决方案，只要您可以在一个主要（可能是好的）限制下工作，那么非常，速度方面很快：电话号码不能从一个0，并且必须是数字或可转换为数字（即，＃34; +＆＃34;不需要查找）。

我正在做的是构建一个1 / null标志数组，每个可能的前缀一个1 / null标志。除此之外不能使用前导0：因为＆＃39; 9512＆＃39;和＆＃39; 09512＆＃39;是相同的数字。这可以解决 - 添加＆＃39; 1＆＃39;在开始时（所以如果你有可能的6位前缀，那么一切都是1000000 +前缀）例如可以工作 - 但它需要调整下面（并可能有性能影响，虽然我认为它不会那么糟糕）。如果＆＃34; +＆＃34;也需要，可能需要转换为数字;在这里你可以用＆＃34; +＆＃34;获得2000000添加到开头，或类似的东西。

好消息是，每行最多只需要6个查询（或左右） - 比任何其他搜索选项快得多（因为临时数组是连续的内存块，它是＆＃39; s只需＆＃34;去检查预先计算的6个内存地址＆＃34;）。哈希和格式将是一个相当大的块，因为他们必须重新查找每一个。

一个主要的性能建议：注意你的前缀可能无法匹配的方式。检查6然后5然后4然后......可能会更快，或检查1然后2然后3然后......可能会更快。这一切都取决于实际的前缀本身和实际的电话号码。如果您的大多数前缀都是＆＃34; + 11＆＃34;这样的事情，你几乎肯定想从左边开始，如果那个和＃34; 94＆＃34;很快就会发现不匹配。

用这个，解决方案。

data prefix_match;
  if _n_=1 then do;
    array prefixes[1000000] _temporary_;
    do _i = 1 to nobs_prefix;
      set prefixes point=_i nobs=nobs_prefix;
      prefixes[prefix]=1;
    end;
    call missing(prefix);
  end;  
  set phone_numbers;
  do _j = 6 to 1 by -1;
    prefix = input(substr(phone_no,1,_j),6.);
    if prefix ne 0 and prefixes[prefix]=1 then leave;
    prefix=.;
  end;
  drop _:;
run;

对于一个拥有40k前缀和100m电话号码（并且没有其他变量）的测试装置，这在我的（好）笔记本电脑上运行了1分多钟，而不是6，并且使用格式解决方案进行更改4并更改使用哈希解决方案（将其修改为输出所有行，因为其他两个解决方案都可以）。这对我来说似乎是正确的。

Answer 3

这是一个哈希表示例。

生成一些虚拟数据。

data phone_numbers(keep=phone)
     prefixes(keep=prefix);
     ;

  length phone $10 prefix $4;
  do i=1 to 10000000;
    phone = cats(int(ranuni(0) * 9999999999 + 1));
    len = int(ranuni(0) * 4 + 1);
    prefix = substr(phone,1,len);
    if input(phone,best.) ge 1000000000 then do;
      output;
    end;
  end;

run;

假设最长的前缀是4个字符，请尝试找到最长的匹配，然后继续，直到尝试了最短的前缀。如果找到匹配项，则输出记录并继续进行下一次观察。

data ht;
  attrib prefix length=$4;

  set phone_numbers;

  if _n_ eq 1 then do;
    declare hash ht(dataset:"prefixes");
    ht.defineKey('prefix');
    ht.defineDone();
  end;

  do len=4 to 1 by -1;
    prefix = substr(phone,1,len);
    if ht.find() eq 0 then do;
      output;
      leave;
    end;
  end;

  drop len;

run;

如果找不到匹配项来输出记录并将前缀字段留空，可能需要添加逻辑吗？不确定你想如何处理这种情况。

合并SAS

3 个答案: