Bigquery%喜欢%分组

时间:2016-06-08 17:32:27

标签: google-bigquery

我有一个包含产品名称列表的表格。我需要统计每种产品。一些产品名称是在不同情况下编写的,例如:“Juice”产品 - 果汁,果汁等。我需要将这些产品组合在一起并使用bigquery显示计数

果汁 - 100
果汁14
牛奶-10
牛奶3
MIL-1

上表必须如下所示

果汁 - 114 牛奶 - 14

2 个答案:

答案 0 :(得分:1)

如果没有您想要考虑的拼写错误的单词 - 解决方案就像下面的

一样简单
SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
FROM YourTable
GROUP BY 1

但在你的情况下,你需要先处理相似性 请在下面查看要考虑的选项

首先让我们来看看高级逻辑/步骤

  

步骤0 - 假设您的表(YourTable)如下所示

SELECT
  word, cnt
FROM
  (SELECT 'Juice' AS word, 100 AS cnt),
  (SELECT 'juice' AS word, 14 AS cnt),
  (SELECT 'Milk' AS word, 10 AS cnt),
  (SELECT 'milk' AS word, 3 AS cnt),
  (SELECT 'milkk' AS word, 1 AS cnt),
  (SELECT 'mil' AS word, 1 AS cnt)
  

第1步 - 计算相似度

让我们只考虑那些在0.5和1之间具有相似性的那些 因此,预期结果将如下所示

word    replacement similarity   
milkk   milk        0.8  
mil     milk        0.6666666666666667   
milkk   mil         0.6 
  

第2步 - 寻找获奖者

你会期望:

word    replacement  
milkk   milk     
mil     milk    
  

第3步 - 最终聚合

word    cnt  
juice   114  
milk    15  
  

以下是各自的代码

最有可能是优化,改进和组合 - 但它就是给你一个想法(和工作代码)的方式

  

查询1(步骤1) - 替换候选人

让我们将输出写入表格 - >替换

SELECT text1 AS word, text2 AS replacement, similarity FROM 
JS(
// input table
(
  SELECT 
    word1 AS text1, 
    word2 AS text2
  FROM (
    SELECT
      CASE WHEN a.cnt < b.cnt THEN a.word ELSE b.word END AS word1,
      CASE WHEN a.cnt < b.cnt THEN b.word ELSE a.word END AS word2
    FROM (
      SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
      FROM YourTable
      GROUP BY 1
    ) AS a
    CROSS JOIN (
      SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
      FROM YourTable
      GROUP BY 1
    ) AS b
    WHERE a.word <= b.word 
  )
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
WHERE similarity > 0.5 AND similarity < 1
ORDER BY similarity DESC
  

查询2(步骤2) - 替换获胜者

SELECT word, replacement FROM (
  SELECT 
    a.word AS word, a.replacement AS replacement, b.replacement, b.weight,
    ROW_NUMBER() OVER(PARTITION BY a.word ORDER BY b.weight DESC) AS win
  FROM (
    SELECT word, replacement
    FROM Replacements
  ) a
  JOIN (
    SELECT replacement, COUNT(1) AS weight
    FROM Replacements
    GROUP BY replacement 
  ) b
  ON a.replacement = b.replacement
)
WHERE win = 1 
  

查询3(第2步和第3步合并) - 替换和最终聚合

SELECT 
  IFNULL(y.replacement, x.word) AS word,
  SUM(cnt) AS cnt
FROM (
  SELECT LOWER(word) AS word, SUM(cnt) AS cnt 
  FROM YourTable
  GROUP BY 1
) x
LEFT JOIN (
  SELECT word, replacement 
  FROM (
    SELECT 
      a.word AS word, a.replacement AS replacement, b.replacement, b.weight,
      ROW_NUMBER() OVER(PARTITION BY a.word ORDER BY b.weight DESC) AS win
    FROM (
      SELECT word, replacement
      FROM Replacements
    ) a
    JOIN (
      SELECT replacement, COUNT(1) AS weight
      FROM Replacements
      GROUP BY replacement 
    ) b
    ON a.replacement = b.replacement
  )
  WHERE win = 1
) y
ON x.word = y.word
GROUP BY word

即使上述工作 - 并且您可以通过示例运行它 - 我无法保证这将完全按照您对实际数据的预期工作。但我希望这能为你提供一个探索的好方向

答案 1 :(得分:0)

这适合你吗?

  public FileResult Download(int id)
    {
        string contentType = "";
        var arquivos = db.Anexos.ToList();
        string nomeArquivo = (from arquivo in arquivos
                              where arquivo.AnexoId == id
                              select arquivo.Caminho).First();
        string extensao = Path.GetExtension(nomeArquivo);
        string nomeArquivoV = Path.GetFileNameWithoutExtension(nomeArquivo);
        System.Diagnostics.Debug.WriteLine("~/Anexos/" + nomeArquivoV + extensao);
        if (extensao.Equals(".zip"))
            contentType = "application/zip";
        return File(nomeArquivo, contentType,"~/Anexos/" + nomeArquivoV + extensao);
    }