Question

我需要协助编写正则表达式查询以提取日志文件中的所有网站地址。日志文件的每一行都包含一堆信息（IP地址，协议，字节，请求的网站等）。

具体来说，我想删除任何以“http：//”开头的内容，并以特定的“.ENDING”结尾，我指定“ENDING = com，biz，net，tv，info”我不关心完整网址（即：http：//www.google.com/bla/page2=blablabla，简称http://www.google.com）。这个正则表达式查询中更难的部分是我希望它能够选择包含.com或.info或.biz作为子域的域（即：http：//www.google.com.MaliciousWebsite.com）有什么办法吗？在这种情况下，抓住完整的域名而不是在google.com上砍掉它？

我之前从未写过正则表达式查询，因此我尝试使用在线参考图表（http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/），但我正在努力。以下是我到目前为止的情况：

"\A[http://]\Z[\.][com,info,biz,tv,net]"

*对不起URL中的间距但stackoverflow正在标记它们，因为我是新的，我只能发布最多2个。

感谢您的帮助。

更新：基于迄今为止所有人的出色反馈，我认为最好编写此规则，以便它可以接收之间的所有内容（http或https ）和（无效的URL字符：？，！，@，＃，$，％，^，＆amp;，*，（，），[，{，}，]，|，/，'，“，;，＆LT;，＆GT）

这将确保抓住所有顶级域名，并抓取google.com.bad.website.com等网站。到目前为止，这是我的模型：

"\A[https?://]'?!(!@#$%^&*()-=[]{}|\'";,<>)"

再次感谢您的帮助。

Answer 1

不确定您正在使用哪种正则表达式语言，因此我将使用.NET语法。怎么样：

@"^https?://[^?/#\s\r]+"

这并不完美，但real spec for domain names is a beast以及http://或https://的存在应足以告诉您路上有域名。

角色类should be fine内的?和#，但我没有机会检查它。您可能需要使用\转义它们。

此外，这也将捕获端口号。如果您不想这样，请将:添加到否定的字符类。

编辑：PCRE版本应该是这样的：

^https?:\/\/[^?\/#\s\r]+

我最近没有使用过PCRE，所以你可能想和有人一起检查。我不确定哪些字符需要在PCRE中的字符类中进行转义。

Answer 2

你可以试试这个表达：

\b((?:http://)(?:.)*(?:\.)(?:com|info|biz|tv|net))

你可以看一下这里的描述：）

r"""
\b               # Assert position at a word boundary
(                # Match the regular expression below and capture its match into backreference number 1
   (?:              # Match the regular expression below
      http://          # Match the characters “http://” literally
   )
   (?:              # Match the regular expression below
      .                # Match any single character that is not a line break character
   )*               # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
      \.               # Match the character “.” literally
   )
   (?:              # Match the regular expression below
                       # Match either the regular expression below (attempting the next alternative only if this one fails)
         com              # Match the characters “com” literally
      |                # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         info             # Match the characters “info” literally
      |                # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
         biz              # Match the characters “biz” literally
      |                # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
         tv               # Match the characters “tv” literally
      |                # Or match regular expression number 5 below (the entire group fails if this one fails to match)
         net              # Match the characters “net” literally
   )
)
"""

Answer 3

这将捕获http或https后跟：//以及不包含空格或斜杠的域名请注意，各种编程语言都存在一些正则表达式的缺点。您可能需要/ \/或Java，\

需要\\加倍

https?://[^ /]+\.(?:com|info|biz|tv|net)

{{1}}

Answer 4

^http\:\/\/(.+)\.(com|info|biz|tv|net)

将捕获以指定tld结尾的http域中的所有域，但也会捕获所有域：http://test.commercial.ly。我没有添加结尾斜杠，因为我不确定你是否总是在域上有一个结尾斜杠，但是如果你在域上总是有一个结尾斜杠，你可以简单地添加一个/到结尾正则表达式如果你不总是有一个结尾斜线，那可能会给你一些误报。如果需要，您还可以添加https支持。你确定要指定tld吗？或者你想要抓住任何tld？

Answer 5

\A[http://]\Z[\.][.*][com,info,biz,tv,net]?![\.]

不确定您正在使用什么类型的正则表达式，但似乎您正试图找到包含“.com，net等”的地址。 AND“/”，或更具体的可能是：在.com中结束，并且不在另一个'。'

之前

所以.com.com无效，但.com /或.com就是

Answer 6

嗯，你好user662772：

好吧，我不是想要嗤之以鼻但你考虑过使用awk吗？它会将您的日志文件拆分为字段，然后您只需打印您所追踪的字段即可。 Bonus Awk执行正则表达式模式匹配和替换。

但你问的是正则表达式：

我正在使用Perl的正则表达式：

HTTP *。（\ COM | \。组织| \ .NET）

woops必须双重逃脱反斜杠。

正则表达式 - 从日志文件中提取网站地址

6 个答案: