有没有办法在Google BigQuery中测量字符串相似性

时间:2015-10-30 10:34:56

标签: javascript regex google-bigquery udf

我想知道是否有人知道在BigQuery中测量字符串相似性的方法。

似乎是一个很好的功能。

我的情况是我需要比较两个网址的相似性,因为他们希望相当肯定他们引用同一篇文章。

我可以找到examples using javascript所以也许UDF是可行的方式,但我根本没有使用过UDF(或javascript):)

只是想知道是否有使用现有正则表达式函数的方法,或者是否有人能够让我开始将javascript示例移植到UDF中。

非常感谢任何帮助,谢谢

编辑:添加一些示例代码

因此,如果我将UDF定义为:

// distance function

function levenshteinDistance (row, emit) {

  //if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
  if (typeof row.inputA === 'undefined') {var myresult = 1};
  if (typeof row.inputB === 'undefined') {var myresult = 1};
  //if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};

    var myresult = Math.min(
        levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
        levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
        levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
    ) + 1;

  emit({outputA: myresult})

}

bigquery.defineFunction(
  'levenshteinDistance',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  levenshteinDistance                       // Reference to JavaScript UDF
);

// make a test function to test individual parts

function test(row, emit) {
  if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
  emit({outputA: x});
}

bigquery.defineFunction(
  'test',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  test                       // Reference to JavaScript UDF
);

我尝试使用如下查询进行测试:

SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))

我收到错误:

错误:TypeError:无法读取属性&#39; substr&#39;未定义的第11行,第38-39栏 错误位置:用户定义的函数

似乎row.inputA可能不是字符串,或者由于某种原因字符串函数无法处理它。不确定这是一个类型问题还是一些有趣的东西,默认情况下UDF能够使用哪些工具。

再次感谢任何帮助,谢谢。

8 个答案:

答案 0 :(得分:3)

通过JS的Levenshtein将是最佳选择。您可以使用算法获取绝对字符串距离,或通过简单计算abs(strlen - distance / strlen).

将其转换为百分比相似度

实现这一目标的最简单方法是定义一个Levenshtein UDF,它接受两个输入a和b,并计算它们之间的距离。该函数可以返回a,b和距离。

要调用它,您可以将这两个网址作为别名传递给&#39; a&#39;和&#39; b&#39;:

SELECT a, b, distance
FROM
  Levenshtein(
     SELECT
       some_url AS a, other_url AS b
     FROM
       your_table
  )

答案 1 :(得分:3)

如果您熟悉Python,则可以使用由GCS加载的外部库在BigQuery中使用fuzzywuzzy定义的功能。

步骤

  1. 下载Fuzzywuzzy(fuzzball)的javascript版本
  2. 获取库的编译文件:dist/fuzzball.umd.min.js并将其重命名为更清晰的名称(例如fuzzball
  3. 将其上传到Google云存储桶
  4. 创建一个临时函数以在查询中使用lib(将OPTIONS中的路径设置为相关路径)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
  library="gs://my-bucket/fuzzball.js");

with data as (select "my_test_string" as a, "my_other_string" as b)

SELECT  a, b, token_set_ratio(a, b) from data

答案 2 :(得分:2)

准备使用共享UDF-Levenshtein距离:

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

6  2  0

Soundex:

SELECT fhoffa.x.soundex('felipe')
 , fhoffa.x.soundex('googgle')
 , fhoffa.x.soundex('guugle')

F410  G240  G240

模糊地选择一个:

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  #, ['john', 'johnny', 'jonathan', 'jonas']
)

johnny

操作方法:

答案 3 :(得分:1)

我无法找到对此的直接答案,因此我在标准SQL中提出此解决方案

#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
  (
  SELECT
    SUM(counter) AS diff
  FROM (
    SELECT
      CASE
        WHEN X.value != Y.value THEN 1
        ELSE 0
      END AS counter
    FROM (
      SELECT
        value,
        ROW_NUMBER() OVER() AS row
      FROM
        UNNEST(SPLIT(a, "")) AS value ) X
    JOIN (
      SELECT
        value,
        ROW_NUMBER() OVER() AS row
      FROM
        UNNEST(SPLIT(b, "")) AS value ) Y
    ON
      X.row = Y.row )
   )
);

WITH Input AS (
  SELECT 'abcdef' AS strings UNION ALL
  SELECT 'defdef' UNION ALL
  SELECT '1bcdef' UNION ALL
  SELECT '1bcde4' UNION ALL
  SELECT '123de4' UNION ALL
  SELECT 'abc123'
)

SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;

与其他解决方案(like this one)相比,它需要两个字符串(长度相同,遵循汉明距离的定义)并输出预期距离。

答案 4 :(得分:0)

以下是使用WITH OFFSET代替ROW_NUMBER() OVER()

的汉明距离相当简单的版本    
#standardSQL
WITH Input AS (
  SELECT 'abcdef' AS strings UNION ALL
  SELECT 'defdef' UNION ALL
  SELECT '1bcdef' UNION ALL
  SELECT '1bcde4' UNION ALL
  SELECT '123de4' UNION ALL
  SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings, 
  (SELECT COUNT(1) 
    FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
    JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
    ON x = y AND a != b) hamming_distance
FROM Input

答案 5 :(得分:0)

尝试Flookup来使用Google表格...肯定比Levenshtein距离快,并且可以立即计算出相似度百分比。 您可能会发现有用的一个Flookup函数是:

FUZZYMATCH (string1, string2)

参数详细信息

  1. string1:与string2比较。
  2. string2:与string1比较。

然后根据这些比较计算相似度百分比。这两个参数都可以是范围。

我目前正在尝试针对大型数据集进行优化,因此非常欢迎您feedback

编辑:我是Flookup的创建者。

答案 6 :(得分:0)

当我在寻找上面的答案Felipe时,我进行了自己的查询,最终得到两个版本,一个版本称为字符串 approximation ,另一个版本称为字符串 recombance

首先要查看源字符串和测试字符串字母之间的最短距离,并返回0到1之间的分数,其中1是完全匹配项。它将始终基于两者中最长的字符串进行评分。事实证明,返回的结果与Levensthein距离相似。

#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
                              select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref, 
                              case 
                                when min(result) is null then 0
                                else 1 / (min(result) + 1) 
                              end as best_result,
                              from (
                                       select *,
                                              if(source = test, abs(sourceoffset - (testoffset)),
                                              greatest(length(testString),length(sourceString))) as result
                                       from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                                cross join
                                            (select *
                                             from unnest(split(lower(testString),'')) as test with offset as testoffset)
                                       ) as results
                              group  by ref
                                 )
        )
);

第二个是第一个的变体,它将查看匹配距离的序列,这样,与该字符之前或之后的字符相等距离匹配的字符将被视为一个点。效果很好,比字符串近似更好,但不如我想要的那样(请参见下面的示例输出)。

    #standarSql
    CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
      select ref,
             if(array_length(array(select * from comparison.collection intersect distinct
                                   (select * from comparison.before))) > 0
                    or array_length(array(select * from comparison.collection intersect distinct
                                          (select * from comparison.after))) > 0
                 , 1, 0) as sequence

      from (
               select ref,
                      collection,
                      lag(collection) over (order by ref)  as before,
                      lead(collection) over (order by ref) as after
               from (
                     select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
                            array_agg(result ignore nulls)                                          as collection
                     from (
                              select *,
                                     if(source = test, abs(sourceoffset - (testoffset)), null) as result
                              from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                       cross join
                                   (select *
                                    from unnest(split(lower(testString),'')) as test with offset as testoffset)
                              ) as results
                     group by ref
                        )
               ) as comparison
      )

)
);

现在这是结果示例:

#standardSQL
with test_subjects as (
  select 'benji' as name union all
  select 'benjamin' union all
  select 'benjamin alan artis' union all
  select 'ben artis' union all
  select 'artis benjamin' 
)

select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects

order by resemblance desc

这将返回

+---------------------+--------------------+--------------------+
| name                | approximation      | resemblance        |
+---------------------+--------------------+--------------------+
| artis benjamin      | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis           | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin            | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji               | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------

已编辑:更新了相似度算法以改善结果。

答案 7 :(得分:0)

did it喜欢这样:

CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
  (
    WITH a_trigrams AS (
      SELECT
        DISTINCT tri_a
      FROM
        unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
    ),
    b_trigrams AS (
      SELECT
        DISTINCT tri_b
      FROM
        unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
    )
    SELECT
      COUNTIF(tri_b IS NOT NULL) / COUNT(*)
    FROM
      a_trigrams
      LEFT JOIN b_trigrams ON tri_a = tri_b
  )
);

这是与 Postgres's pg_trgm 的比较:

select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727

select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4

我在 How to perform trigram operations in Google BigQuery?

上给出了相同的答案