在Big Query中拆分字段

时间:2017-01-12 18:17:13

标签: google-bigquery

我已经四处寻找,但在这个主题上找不到多少(可能是搜索条件不好:)。我有一个表,Protopayload.resource,它获取Apache日志信息。因此,我感兴趣的字段包含我需要搜索的多个值。该字段的格式为php URL样式。 即

/?id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN

这使得所有搜索都以真正长的正则表达式来获取数据。然后连接语句以组合数据。

搜索结合mac / win统计数据的示例

SELECT
  t1.date, t1.wincount, COALESCE(t2.maccount, 0) AS maccount
FROM (
  SELECT
    DATE(metadata.timestamp) AS date,
    INTEGER(COUNT(protoPayload.resource)) AS wincount
  FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP() ))
  WHERE
    (REGEXP_MATCH(protoPayload.resource, r'ver=[11,12'))
    AND protoPayload.resource CONTAINS 'os=win' GROUP BY date ) t1
LEFT JOIN (
  SELECT
    DATE(metadata.timestamp) AS date,
    INTEGER(COUNT(protoPayload.resource)) AS maccount
  FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP() ))
  WHERE
    (REGEXP_MATCH(protoPayload.resource, r'cv=[p,m][17,16,15,14]'))
    AND protoPayload.resource CONTAINS 'os=mac' GROUP BY date ) t2
ON
  t1.date = t2.date
ORDER BY t1.date

我在想的是使用类似的正则表达式搜索。创建一个新表。然后将数据保存到具有关系字段的新表中。然后修复将来的日志记录,以便正确记录到表中。

我的问题是这个有效的解决方案,还是有更简单的方法在Google BigQuery中实现这一目标?有没有更好的方法来转换数据? 再次感谢您的任何意见!

3 个答案:

答案 0 :(得分:3)

您可以使用SQL函数将键值对解析为数组,这通常比使用JavaScript更快。例如,

#standardSQL
CREATE TEMPORARY FUNCTION ParseKeys(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
  (SELECT
     ARRAY_AGG(STRUCT(
       entry[OFFSET(0)] AS key,
       entry[OFFSET(1)] AS value))
   FROM (
     SELECT SPLIT(pairString, '=') AS entry
     FROM UNNEST(SPLIT(REGEXP_EXTRACT(queryString, r'/\?(.*)'), '&')) AS pairString)
   )
);
SELECT ParseKeys('/?foo=bar&baz=2');

现在,您可以使用一个将键转换为struct字段的函数来构建它:

#standardSQL
CREATE TEMP FUNCTION GetAttributes(queryString STRING) AS (
  (SELECT AS STRUCT
     MAX(IF(key = 'id', CAST(value AS INT64), NULL)) AS id,
     MAX(IF(key = 'ver', CAST(value AS INT64), NULL)) AS ver,
     MAX(IF(key = 'os_bits', CAST(value AS INT64), NULL)) AS os_bits,
     MAX(IF(key = 'os_type', value, NULL)) AS os_type,
     MAX(IF(key = 'lng', value, NULL)) AS lng
   FROM UNNEST(ParseKeys(queryString)))
);

将所有内容放在一起,您可以尝试使用一些示例输入GetAttributes函数:

#standardSQL
CREATE TEMPORARY FUNCTION ParseKeys(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
  (SELECT
     ARRAY_AGG(STRUCT(
       entry[OFFSET(0)] AS key,
       entry[OFFSET(1)] AS value))
   FROM (
     SELECT SPLIT(pairString, '=') AS entry
     FROM UNNEST(SPLIT(REGEXP_EXTRACT(queryString, r'/\?(.*)'), '&')) AS pairString)
   )
);
CREATE TEMP FUNCTION GetAttributes(queryString STRING) AS (
  (SELECT AS STRUCT
     MAX(IF(key = 'id', CAST(value AS INT64), NULL)) AS id,
     MAX(IF(key = 'ver', CAST(value AS INT64), NULL)) AS ver,
     MAX(IF(key = 'os_bits', CAST(value AS INT64), NULL)) AS os_bits,
     MAX(IF(key = 'os_type', value, NULL)) AS os_type,
     MAX(IF(key = 'lng', value, NULL)) AS lng
   FROM UNNEST(ParseKeys(queryString)))
);
SELECT url, GetAttributes(url).*
FROM UNNEST(['/?id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN',
             '/?id=2343645745&ver=15&os_bits=32&os_type=linux&lng=FR']) AS url;

答案 1 :(得分:1)

您始终可以使用Javascript UDF获得最大的灵活性。它们比纯SQL解决方案慢,但您可以围绕其局限性进行编码。

例如:

#standardSQL
CREATE TEMPORARY FUNCTION parse(query STRING)
RETURNS STRUCT<id STRING, ver STRING, os_bits STRING, os_type STRING, lng STRING>
LANGUAGE js AS """
  function parseQueryString(query) {
      // http://codereview.stackexchange.com/a/10396
      var  map   = {};
      query.replace(/([^&=]+)=?([^&]*)(?:&+|$)/g, function(match, key, value) {
          (map[key] = map[key] || []).push(value);
      });
      return map;
  }

  return parseQueryString(query)  

""";


WITH urls AS
  (SELECT 'id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN' query
   UNION ALL
   SELECT 'id=13242134124&ver=12&os_bits=64&os_type=mac&lng=EN1&lng=EN2' query
)


SELECT query, parse(query) as parsed
FROM urls;.

enter image description here

答案 2 :(得分:0)

我在你的问题中看到查询中的问题很少 1.看起来像正则表达式是不正确的,并不会捕捉你期望的结果 2.查询严重过度设计,可以非常简化

以下是解决上述问题

SELECT
  DATE(metadata.timestamp) AS date,
  SUM(REGEXP_MATCH(protoPayload.resource, r'ver=(11|12)\b') 
      AND protoPayload.resource CONTAINS 'os_type=win'
  ) AS wincount,
  SUM(REGEXP_MATCH(protoPayload.resource, r'cv=(p|m)(17|16|15|14)\b') 
      AND protoPayload.resource CONTAINS 'os_type=mac'
  ) AS maccount
FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'),
                                  CURRENT_TIMESTAMP() ))
GROUP BY date

请注意:您查询的问题是使用BigQuery Legacy SQL编写的,所以我用相同的方言保留我的答案