我已经四处寻找,但在这个主题上找不到多少(可能是搜索条件不好:)。我有一个表,Protopayload.resource,它获取Apache日志信息。因此,我感兴趣的字段包含我需要搜索的多个值。该字段的格式为php URL样式。 即
/?id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN
这使得所有搜索都以真正长的正则表达式来获取数据。然后连接语句以组合数据。
搜索结合mac / win统计数据的示例
SELECT
t1.date, t1.wincount, COALESCE(t2.maccount, 0) AS maccount
FROM (
SELECT
DATE(metadata.timestamp) AS date,
INTEGER(COUNT(protoPayload.resource)) AS wincount
FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP() ))
WHERE
(REGEXP_MATCH(protoPayload.resource, r'ver=[11,12'))
AND protoPayload.resource CONTAINS 'os=win' GROUP BY date ) t1
LEFT JOIN (
SELECT
DATE(metadata.timestamp) AS date,
INTEGER(COUNT(protoPayload.resource)) AS maccount
FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP() ))
WHERE
(REGEXP_MATCH(protoPayload.resource, r'cv=[p,m][17,16,15,14]'))
AND protoPayload.resource CONTAINS 'os=mac' GROUP BY date ) t2
ON
t1.date = t2.date
ORDER BY t1.date
我在想的是使用类似的正则表达式搜索。创建一个新表。然后将数据保存到具有关系字段的新表中。然后修复将来的日志记录,以便正确记录到表中。
我的问题是这个有效的解决方案,还是有更简单的方法在Google BigQuery中实现这一目标?有没有更好的方法来转换数据? 再次感谢您的任何意见!
答案 0 :(得分:3)
您可以使用SQL函数将键值对解析为数组,这通常比使用JavaScript更快。例如,
#standardSQL
CREATE TEMPORARY FUNCTION ParseKeys(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
(SELECT
ARRAY_AGG(STRUCT(
entry[OFFSET(0)] AS key,
entry[OFFSET(1)] AS value))
FROM (
SELECT SPLIT(pairString, '=') AS entry
FROM UNNEST(SPLIT(REGEXP_EXTRACT(queryString, r'/\?(.*)'), '&')) AS pairString)
)
);
SELECT ParseKeys('/?foo=bar&baz=2');
现在,您可以使用一个将键转换为struct字段的函数来构建它:
#standardSQL
CREATE TEMP FUNCTION GetAttributes(queryString STRING) AS (
(SELECT AS STRUCT
MAX(IF(key = 'id', CAST(value AS INT64), NULL)) AS id,
MAX(IF(key = 'ver', CAST(value AS INT64), NULL)) AS ver,
MAX(IF(key = 'os_bits', CAST(value AS INT64), NULL)) AS os_bits,
MAX(IF(key = 'os_type', value, NULL)) AS os_type,
MAX(IF(key = 'lng', value, NULL)) AS lng
FROM UNNEST(ParseKeys(queryString)))
);
将所有内容放在一起,您可以尝试使用一些示例输入GetAttributes
函数:
#standardSQL
CREATE TEMPORARY FUNCTION ParseKeys(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
(SELECT
ARRAY_AGG(STRUCT(
entry[OFFSET(0)] AS key,
entry[OFFSET(1)] AS value))
FROM (
SELECT SPLIT(pairString, '=') AS entry
FROM UNNEST(SPLIT(REGEXP_EXTRACT(queryString, r'/\?(.*)'), '&')) AS pairString)
)
);
CREATE TEMP FUNCTION GetAttributes(queryString STRING) AS (
(SELECT AS STRUCT
MAX(IF(key = 'id', CAST(value AS INT64), NULL)) AS id,
MAX(IF(key = 'ver', CAST(value AS INT64), NULL)) AS ver,
MAX(IF(key = 'os_bits', CAST(value AS INT64), NULL)) AS os_bits,
MAX(IF(key = 'os_type', value, NULL)) AS os_type,
MAX(IF(key = 'lng', value, NULL)) AS lng
FROM UNNEST(ParseKeys(queryString)))
);
SELECT url, GetAttributes(url).*
FROM UNNEST(['/?id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN',
'/?id=2343645745&ver=15&os_bits=32&os_type=linux&lng=FR']) AS url;
答案 1 :(得分:1)
您始终可以使用Javascript UDF获得最大的灵活性。它们比纯SQL解决方案慢,但您可以围绕其局限性进行编码。
例如:
#standardSQL
CREATE TEMPORARY FUNCTION parse(query STRING)
RETURNS STRUCT<id STRING, ver STRING, os_bits STRING, os_type STRING, lng STRING>
LANGUAGE js AS """
function parseQueryString(query) {
// http://codereview.stackexchange.com/a/10396
var map = {};
query.replace(/([^&=]+)=?([^&]*)(?:&+|$)/g, function(match, key, value) {
(map[key] = map[key] || []).push(value);
});
return map;
}
return parseQueryString(query)
""";
WITH urls AS
(SELECT 'id=13242134123&ver=12&os_bits=64&os_type=mac&lng=EN' query
UNION ALL
SELECT 'id=13242134124&ver=12&os_bits=64&os_type=mac&lng=EN1&lng=EN2' query
)
SELECT query, parse(query) as parsed
FROM urls;.
答案 2 :(得分:0)
我在你的问题中看到查询中的问题很少 1.看起来像正则表达式是不正确的,并不会捕捉你期望的结果 2.查询严重过度设计,可以非常简化
以下是解决上述问题
SELECT
DATE(metadata.timestamp) AS date,
SUM(REGEXP_MATCH(protoPayload.resource, r'ver=(11|12)\b')
AND protoPayload.resource CONTAINS 'os_type=win'
) AS wincount,
SUM(REGEXP_MATCH(protoPayload.resource, r'cv=(p|m)(17|16|15|14)\b')
AND protoPayload.resource CONTAINS 'os_type=mac'
) AS maccount
FROM (TABLE_DATE_RANGE(tablename, DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'),
CURRENT_TIMESTAMP() ))
GROUP BY date
请注意:您查询的问题是使用BigQuery Legacy SQL编写的,所以我用相同的方言保留我的答案