在Big Query中使用regexp来提取URL

时间:2016-10-22 05:57:53

标签: sql regex google-bigquery

我一直试图提取我的“文本”中的任何网址。 Big Query中的列。该列包含整个文本和URL的混合(一个单元格可能包含多个URL)我尝试使用此正则表达式:

SELECT

  REGEXP_EXTRACT  (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*')
FROM
Data.Text_Files

我目前无法解析正则表达式'当我尝试运行查询时。我试过修改它但无济于事。

正则表达式适用于在线构建器,但我不确定如何将其合并到Big Query中。

非常感谢任何帮助 - 或至少指出如何将正则表达式合并到Big Query中!

2 个答案:

答案 0 :(得分:4)

请尝试以下内容 - 适用于BigQuery Standard SQL(请参阅Enabling Standard SQLMigrating from legacy SQL

WITH YourTable AS (
  SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask).  ' AS Text UNION ALL
  SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
  SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
)
SELECT 
 id, 
 REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable

这将为您提供带有id字段的输出,以及包含所有相应URL的重复字段

如果您需要扁平化结果 - 您可以使用以下变化

WITH YourTable AS (
  SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask).  ' AS Text UNION ALL
  SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
  SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text 
)
SELECT
  id, URL    
FROM (
  SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
  FROM YourTable
), UNNEST(URL) as URL

注意:你可以在这里使用你可以在网上找到的任何正则表达式 - 但必须的是 - 只允许一个匹配的组!因此,所有内部匹配组都应使用?:进行转义,您可以在上面的示例中看到它。因此,您希望在输出中看到的唯一组应保持原样 - w / o ?:

答案 1 :(得分:2)

你的正则表达式有一个不完整的捕获组,并有2个未转义的字符。我不知道您正在使用哪个在线正则表达式构建器,但是您可能忘了将新的正则表达式放入其中?

问题如下:

(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*
POINTERS TO PROBLEMS ON THIS LINE --->                             ^1                    ^^2
  1. 这是一个没有结束的捕获组的开始。您可能希望)*之前。
  2. 需要转义所有斜杠。这可能应该是\/,甚至可能是\/\\
  3. 以下是我实施的两项建议的示例:https://regex101.com/r/pt1hqS/1

    祝你好运!