我一直试图提取我的“文本”中的任何网址。 Big Query中的列。该列包含整个文本和URL的混合(一个单元格可能包含多个URL)我尝试使用此正则表达式:
SELECT
REGEXP_EXTRACT (Text, r'(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*')
FROM
Data.Text_Files
我目前无法解析正则表达式'当我尝试运行查询时。我试过修改它但无济于事。
正则表达式适用于在线构建器,但我不确定如何将其合并到Big Query中。
非常感谢任何帮助 - 或至少指出如何将正则表达式合并到Big Query中!
答案 0 :(得分:4)
请尝试以下内容 - 适用于BigQuery Standard SQL(请参阅Enabling Standard SQL和Migrating from legacy SQL)
WITH YourTable AS (
SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL
SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text
)
SELECT
id,
REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable
这将为您提供带有id字段的输出,以及包含所有相应URL的重复字段
如果您需要扁平化结果 - 您可以使用以下变化
WITH YourTable AS (
SELECT 1 AS id, 'What have you tried so far? Please edit your question to show a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) of the code that you are having problems with, then we can try to help with the specific problem. You can also read [How to Ask](http://stackoverflow.com/help/how-to-ask). ' AS Text UNION ALL
SELECT 2 AS id, 'Important on SO, you can mark accepted answer by using the tick on the left of the posted answer, below the voting. see http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235 for why it is important. There are more ... You can check about what to do when someone answers your question - http://stackoverflow.com/help/someone-answers.' AS Text UNION ALL
SELECT 3 AS id, 'If an answer has helped you solve your problem and you accept it you should also consider voting it up. See more at http://stackoverflow.com/help/someone-answers and Upvote section in http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work#5235' AS Text
)
SELECT
id, URL
FROM (
SELECT id, REGEXP_EXTRACT_ALL(Text, r'(?i:(?:(?:(?:ftp|https?):\/\/)(?:www\.)?|www\.)(?:[\da-z-_\.]+)(?:[a-z\.]{2,7})(?:[\/\w\.-_\?\&]*)*\/?)') AS URL
FROM YourTable
), UNNEST(URL) as URL
注意:你可以在这里使用你可以在网上找到的任何正则表达式 - 但必须的是 - 只允许一个匹配的组!因此,所有内部匹配组都应使用?:
进行转义,您可以在上面的示例中看到它。因此,您希望在输出中看到的唯一组应保持原样 - w / o ?:
答案 1 :(得分:2)
你的正则表达式有一个不完整的捕获组,并有2个未转义的字符。我不知道您正在使用哪个在线正则表达式构建器,但是您可能忘了将新的正则表达式放入其中?
问题如下:
(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9%_:?\+.~#&//=]*
POINTERS TO PROBLEMS ON THIS LINE ---> ^1 ^^2
)
在*
之前。 \/
,甚至可能是\/\\
。以下是我实施的两项建议的示例:https://regex101.com/r/pt1hqS/1
祝你好运!