REGEXP_REPLACE模式必须是const?比较BigQuery

时间:2016-03-22 02:05:21

标签: google-bigquery

我试图在BigQuery中使用Dice' s Coefficient(又名对类似性)测量字符串之间的相似性。有一秒钟我认为我只能使用标准功能来做到这一点。

假设我需要比较" gana"和" gano"。然后我会做饭#34;这两个字符串预先放入' ga | an | na'和' ga | an | no' (2克的清单)并执行此操作:

REGEXP_REPLACE('ga|an|na', 'ga|an|no', '')

然后根据长度的变化,我可以计算出我的系数。

但是一旦应用到桌子上我就得到了:

  

REGEXP_REPLACE第二个参数必须是const且非null

有没有解决方法?用简单的REPLACE()第二个参数可以是一个字段。

也许还有更好的方法吗?我知道,我可以做UDF。但我想在这里避免它们。我们正在运行大型任务,UDF通常较慢(至少根据我的经验)并且受到不同的并发限制。

3 个答案:

答案 0 :(得分:1)

您可以在其中包含BigQuery SQL查询的JavaScript代码。

要测量相似度,您可以使用Levenshtein的距离与此类查询(来自https://stackoverflow.com/a/33443564/132438):

SELECT *
FROM js(
(
  SELECT title,target FROM
   (SELECT 'hola' title, 'hello' target), (SELECT 'this is beautiful' title, 'that is fantastic' target) 
),
  title, target,
  // Output schema.
  "[{name: 'title', type:'string'},
    {name: 'target', type:'string'},
    {name: 'distance', type:'integer'}]",
  // The function
  "function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_title;

  try {
    the_title = decodeURI(r.title).toLowerCase();
  } catch (ex) {
    the_title = r.title.toLowerCase();
  }

  emit({title: the_title, target: r.target,
        distance: Levenshtein.get(the_title, r.target)});

  }")

答案 1 :(得分:1)

以下是针对相似性而量身定制的 已在How to perform trigram operations in Google BigQuery?中使用,并基于https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js @thomaspark

SELECT text1, text2, similarity FROM 
JS(
// input table
(
  SELECT * FROM 
  (SELECT 'mikhail' AS text1, 'mikhail' AS text2),
  (SELECT 'mikhail' AS text1, 'mike' AS text2),
  (SELECT 'mikhail' AS text1, 'michael' AS text2),
  (SELECT 'mikhail' AS text1, 'javier' AS text2),
  (SELECT 'mikhail' AS text1, 'thomas' AS text2)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
ORDER BY similarity DESC

答案 2 :(得分:0)

  

REGEXP_REPLACE第二个参数必须是const且非null   有没有   解决方法?

以下只是一个想法/方向,以解决上述应用于您所描述的逻辑的问题:

  

我会&#34;做饭&#34;这两个字符串预先放入&#39; ga | an | na&#39;和   &#39; GA |的|没有&#39; (2克的清单)并执行此操作:REGEXP_REPLACE(&#39; ga | an | na&#39;,   &#39; ga | an | no&#39;,&#39;&#39;)。然后根据长度的变化,我可以计算我的   系数_

&#34;解决方法&#34;是:

SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c
FROM (
  SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
  FROM 
    (SELECT 'gana' AS w, 'ga|an|na' AS p)
) AS a
JOIN (
  SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
  FROM 
    (SELECT 'gano' AS w, 'ga|an|no' AS p),
    (SELECT 'gamo' AS w, 'ga|am|mo' AS p),
    (SELECT 'kana' AS w, 'ka|an|na' AS p)
) AS b
ON a.pos = b.pos
GROUP BY w1, w2  
  

也许还有更好的方法吗?

下面是一个简单的例子,说明如何在这里接近对象相似性(包括构建双字组集和系数计算:

SELECT
  a.word AS word1, b.word AS word2, 
  2 * SUM(a.bigram = b.bigram) / 
    (EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c
FROM (
  SELECT word, char + next_char AS bigram
  FROM (
    SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
    FROM (
      SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
      FROM
        (SELECT 'gana' AS word)
    )
  )
  WHERE next_char IS NOT NULL
  GROUP BY 1, 2
) a
CROSS JOIN (
  SELECT word, char + next_char AS bigram
  FROM (
    SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
    FROM (
      SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
      FROM
        (SELECT 'gano' AS word)
    )
  )
  WHERE next_char IS NOT NULL
  GROUP BY 1, 2
) b
GROUP BY 1, 2