匹配两个不同列中的部分单词

时间:2016-07-21 18:35:34

标签: sql google-bigquery

我正在尝试从我们的数据库中清除某个客户。我注意到一种趋势,人们用同名填写他们的名字,这与他们填写公司名称的方式不同。所以一个例子看起来像:

business_name               first_name
-------------               ----------
locksmith taylorsville      locksmith

locksmith roy               locksmi

locksmith clinton           locks

locksmith farmington        locksmith

这些是我不希望被查询的人。他们是坏蛋。我试图将一个查询与WHERE语句放在一起(大概)将任何名字中包含至少部分匹配的人隔离到他们的公司名称,但是我很难过并且可以使用一些帮助

3 个答案:

答案 0 :(得分:1)

您可以使用LIKE运算符:

appcompat-v7

%代表什么。

答案 1 :(得分:0)

您可以采用基于相似性的方法 尝试答案底部的代码
它产生如下结果

business_name           partial_business_name   first_name  similarity   
locksmith taylorsville  locksmith               locksmith   1.0  
locksmith farmington    locksmith               locksmith   1.0  
locksmith roy           locksmith               locksmi     0.7777777777777778   
locksmith clinton       locksmith               locks       0.5555555555555556   

因此,您将能够根据相似度值

控制要过滤的内容

**代码**

SELECT business_name, partial_business_name, first_name, similarity FROM 
JS( // input table
(
  SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM 
    (SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name),
    (SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name),
    (SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name),
    (SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name),
) ,
// input columns
business_name, partial_business_name, first_name,
// output schema
"[{name: 'business_name', type:'string'},
  {name: 'partial_business_name', type:'string'},
  {name: 'first_name', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_partial_business_name;

  try {
    the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
  } catch (ex) {
    the_partial_business_name = r.partial_business_name.toLowerCase();
  }

  try {
    the_first_name = decodeURI(r.first_name).toLowerCase();
  } catch (ex) {
    the_first_name = r.first_name.toLowerCase();
  }

  emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,
        similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});

  }"
)
ORDER BY similarity DESC

用于How to perform trigram operations in Google BigQuery?并基于https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js @thomaspark,其中Levenshtein的距离用于衡量相似度

答案 2 :(得分:0)

这样可以解决问题,

从TableName中选择*,其中lower(business_name)包含lower(first_name)


使用lower()以防它们有大写字母。希望它有所帮助。