根据函数而不是原始内容查找重复的行

时间:2016-01-13 10:14:22

标签: google-bigquery

我有一个bigquery表logs,其中有两列包含日志消息:

time TIMESTAMP
message STRING

我想选择与模式job .+ got machine (\d+)匹配的所有邮件,其中有重复的计算机。例如给出行:

10000, "job foo got machine 10"
10010, "job bar got machine 10"
10010, "job baz got machine 20"

查询将选择前两行。

我可以选择与查询重复的机器:

SELECT
  REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
  [logs]
WHERE
  REGEXP_MATCH(message, r'job .+ got machine \d+')
GROUP BY
  machine_id
HAVING
  COUNT(message) > 1

但我无法弄清楚如何从这里获取包含这些机器的行。我尝试过以下方法:

SELECT
  [time],
  message,
  REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
  [logs]
WHERE
  REGEXP_MATCH(message, r'job .+ got machine \d+')
HAVING
  machine_id IN (
  SELECT
    REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
  FROM
    [logs]
  WHERE
    REGEXP_MATCH(message, r'job .+ got machine \d+')
  GROUP BY
    machine_id
  HAVING
    COUNT(message) > 1)

但这会产生错误“错误:找不到字段'machine_id'”。

是否可以在单个查询中执行我想要的操作?

3 个答案:

答案 0 :(得分:1)

我能够通过以下查询解决这个问题:

SELECT
  [time],
  message
FROM (
  SELECT
    REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
  FROM
    [logs]
  WHERE
    REGEXP_MATCH(message, r'job .+ got machine \d+')
  GROUP BY
    machine_id
  HAVING
    COUNT(message) > 1) AS A
JOIN (
  SELECT
    [time],
    message,
    REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
  FROM    
    [logs]
  WHERE
    REGEXP_MATCH(message, r'job .+ got machine \d+')) AS B
ON
    A.machine_id = B.machine_id
感觉有点笨拙,但似乎做了这个工作。

答案 1 :(得分:0)

在这种情况下不要使用HAVING,只需使用WHERE

即可
SELECT
  [time],
  message,
  REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
FROM
  [logs]
WHERE
  REGEXP_MATCH(message, r'job .+ got machine \d+')
  AND REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') IN (
  SELECT
    REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
  FROM
    [logs]
  WHERE
    REGEXP_MATCH(message, r'job .+ got machine \d+')
  GROUP BY
    machine_id
  HAVING
    COUNT(message) > 1)

答案 2 :(得分:0)

尝试以下

SELECT [time], message 
FROM (
  SELECT [time], message, machine_id, 
    COUNT(1) OVER(PARTITION BY machine_id) AS dups
  FROM (
    SELECT [time], message,
      REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id
    FROM 
      [logs]
  )
)
WHERE dups > 1

没有加入,不那么笨重

或简化甚至更多:

SELECT [time], message FROM (
  SELECT [time], message, 
    REGEXP_EXTRACT(message, r'job .+ got machine (\d+)') machine_id,
    COUNT(1) OVER(PARTITION BY machine_id) AS dups
  FROM 
      [logs]
)
WHERE dups > 1