我试图在BigQuery中使用Dice' s Coefficient(又名对类似性)测量字符串之间的相似性。有一秒钟我认为我只能使用标准功能来做到这一点。
假设我需要比较" gana"和" gano"。然后我会做饭#34;这两个字符串预先放入' ga | an | na'和' ga | an | no' (2克的清单)并执行此操作:
REGEXP_REPLACE('ga|an|na', 'ga|an|no', '')
然后根据长度的变化,我可以计算出我的系数。
但是一旦应用到桌子上我就得到了:
REGEXP_REPLACE第二个参数必须是const且非null
有没有解决方法?用简单的REPLACE()第二个参数可以是一个字段。
也许还有更好的方法吗?我知道,我可以做UDF。但我想在这里避免它们。我们正在运行大型任务,UDF通常较慢(至少根据我的经验)并且受到不同的并发限制。
答案 0 :(得分:1)
您可以在其中包含BigQuery SQL查询的JavaScript代码。
要测量相似度,您可以使用Levenshtein的距离与此类查询(来自https://stackoverflow.com/a/33443564/132438):
SELECT *
FROM js(
(
SELECT title,target FROM
(SELECT 'hola' title, 'hello' target), (SELECT 'this is beautiful' title, 'that is fantastic' target)
),
title, target,
// Output schema.
"[{name: 'title', type:'string'},
{name: 'target', type:'string'},
{name: 'distance', type:'integer'}]",
// The function
"function(r, emit) {
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_title;
try {
the_title = decodeURI(r.title).toLowerCase();
} catch (ex) {
the_title = r.title.toLowerCase();
}
emit({title: the_title, target: r.target,
distance: Levenshtein.get(the_title, r.target)});
}")
答案 1 :(得分:1)
以下是针对相似性而量身定制的 已在How to perform trigram operations in Google BigQuery?中使用,并基于https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js @thomaspark
SELECT text1, text2, similarity FROM
JS(
// input table
(
SELECT * FROM
(SELECT 'mikhail' AS text1, 'mikhail' AS text2),
(SELECT 'mikhail' AS text1, 'mike' AS text2),
(SELECT 'mikhail' AS text1, 'michael' AS text2),
(SELECT 'mikhail' AS text1, 'javier' AS text2),
(SELECT 'mikhail' AS text1, 'thomas' AS text2)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
{name: 'text2', type:'string'},
{name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_text1;
try {
the_text1 = decodeURI(r.text1).toLowerCase();
} catch (ex) {
the_text1 = r.text1.toLowerCase();
}
try {
the_text2 = decodeURI(r.text2).toLowerCase();
} catch (ex) {
the_text2 = r.text2.toLowerCase();
}
emit({text1: the_text1, text2: the_text2,
similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});
}"
)
ORDER BY similarity DESC
答案 2 :(得分:0)
REGEXP_REPLACE第二个参数必须是const且非null 有没有 解决方法?
以下只是一个想法/方向,以解决上述应用于您所描述的逻辑的问题:
我会&#34;做饭&#34;这两个字符串预先放入&#39; ga | an | na&#39;和 &#39; GA |的|没有&#39; (2克的清单)并执行此操作:REGEXP_REPLACE(&#39; ga | an | na&#39;, &#39; ga | an | no&#39;,&#39;&#39;)。然后根据长度的变化,我可以计算我的 系数_
&#34;解决方法&#34;是:
SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c
FROM (
SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
FROM
(SELECT 'gana' AS w, 'ga|an|na' AS p)
) AS a
JOIN (
SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
FROM
(SELECT 'gano' AS w, 'ga|an|no' AS p),
(SELECT 'gamo' AS w, 'ga|am|mo' AS p),
(SELECT 'kana' AS w, 'ka|an|na' AS p)
) AS b
ON a.pos = b.pos
GROUP BY w1, w2
也许还有更好的方法吗?
下面是一个简单的例子,说明如何在这里接近对象相似性(包括构建双字组集和系数计算:
SELECT
a.word AS word1, b.word AS word2,
2 * SUM(a.bigram = b.bigram) /
(EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c
FROM (
SELECT word, char + next_char AS bigram
FROM (
SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
FROM (
SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
FROM
(SELECT 'gana' AS word)
)
)
WHERE next_char IS NOT NULL
GROUP BY 1, 2
) a
CROSS JOIN (
SELECT word, char + next_char AS bigram
FROM (
SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
FROM (
SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
FROM
(SELECT 'gano' AS word)
)
)
WHERE next_char IS NOT NULL
GROUP BY 1, 2
) b
GROUP BY 1, 2