Question

我知道有很多关于懒惰正则表达式匹配的问题，但我见过的解决方案都没有奏效。这是问题所在：

我的BigQuery结果中的一些地址如下所示：

www.example.comwww.example.com/path/to/page
apply.example.comapply.example.com/eapp/

我想剥离重复的部分以获取

www.example.com/path/to/page
apply.example.com/eapp/

我尝试过像这样使用REGEX_REPLACE()：

REGEXP_REPLACE(raw.query.tidy_Landing, r'(.*?)\.com','') AS Landing_Page

但仍然会找到两个匹配，所以返回

/path/to/page
/eapp/

我的正则表达式出了什么问题？

Answer 1

wbgetentities

输出

#standardSQL
WITH t AS (
  SELECT 'www.example.comwww.example.com/path/to/page' str UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/'
)
SELECT str, REGEXP_REPLACE(str, r'^(.*?\.com)', '') fix
FROM t

Answer 2

我想看看是否可以在没有正则表达式的情况下执行此操作，尽管最终有点冗长:)这个答案假设地址总是重复并以.com结束。假设是这种情况，应该可以使用SPLIT来提取您感兴趣的部分：

SELECT
  CONCAT(
    SPLIT(text, '.com')[OFFSET(0)],
    '.com',
    SPLIT(text, '.com')[OFFSET(2)]
  ) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/'
);

如果您希望查询能够容忍非重复地址，您可以稍作修改：

SELECT
  (
    SELECT 
      CONCAT(
        parts[OFFSET(0)],
        '.com',
        parts[OFFSET(ARRAY_LENGTH(parts) - 1)]
      )
    FROM (SELECT SPLIT(text, '.com') AS parts)
  ) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/' UNION ALL
  SELECT 'www.example.com/path/to/page'
);

更进一步，您可以将逻辑提取到UDF中：

CREATE TEMP FUNCTION GetLandingPage(text STRING) AS (
  (
    SELECT 
      CONCAT(
        parts[OFFSET(0)],
        '.com',
        parts[OFFSET(ARRAY_LENGTH(parts) - 1)]
       )
    FROM (SELECT SPLIT(text, '.com') AS parts)
  )
);

SELECT
  GetLandingPage(text) AS Landing_Page
FROM (
  SELECT 'www.example.comwww.example.com/path/to/page' AS text UNION ALL
  SELECT 'apply.example.comapply.example.com/eapp/' UNION ALL
  SELECT 'www.example.com/path/to/page'
);

BigQuery Regex_Replace重复子字符串的第一个实例

2 个答案: