Question

我正在创建一个基于Sinatra的应用程序，并且正在尝试使用正则表达式解析长字符串以从中提取链接。

这里是字符串的摘录，其中包含我需要提取的相关信息：

time=18ms\n[INFO] Calculating CPD for 0 files\n[INFO] CPD calculation finished\n[INFO] Analysis report generated in 325ms, dir size=14 KB\n[INFO] Analysis reports compressed in 187ms, zip size=8 KB\n[INFO] Analysis report uploaded in 31ms\n[INFO] ANALYSIS SUCCESSFUL, you can browse http://sonar.company.com/dashboard/index/com.company.paas.maventestproject:MavenTestProject\n[INFO] Note that you will be able to access the updated dashboard once the server has processed the submitted analysis report\n[INFO] More about the report processing at http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn\n[INFO] -----------------------------------------------------------------------

我需要能够提取以下内容：

http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn

我得到的最接近的是/(?=http).[a*-z]*/，但这并不接近我所需要的，因为它找到了615个匹配而不是1个。

问题还在于，AVhFxTkyob-dgWZqnfIn不是静态的，每个版本都在变化。

我一直在使用Rubular.com来找到我需要使用的正确的正则表达式。

Answer 1

>> string = '[your long string here]'
>> regex = /(http:[\w\/.?=-]+)(\\n)/
>> string.scan(regex).first.first
=> "http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn"

按照上面的示例，我最终将正则表达式修改为以下内容：

(http:\/\/sonar[\w\/.?=-]+task[\w\/.?=-]+(?!.\\n))

..并像这样返回：

string.scan(regex).first.first

我修改正则表达式的原因是因为之前的正则表达式在插入完整字符串时结果有很多结果，而不是OP中的摘录。

Answer 2

经过充分测试的工具可以让您的工作更轻松。我建议使用URI的extract方法：

require 'uri'

str = "time=18ms\n[INFO] Calculating CPD for 0 files\n[INFO] CPD calculation finished\n[INFO] Analysis report generated in 325ms, dir size=14 KB\n[INFO] Analysis reports compressed in 187ms, zip size=8 KB\n[INFO] Analysis report uploaded in 31ms\n[INFO] ANALYSIS SUCCESSFUL, you can browse http://sonar.company.com/dashboard/index/com.company.paas.maventestproject:MavenTestProject\n[INFO] Note that you will be able to access the updated dashboard once the server has processed the submitted analysis report\n[INFO] More about the report processing at http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn\n[INFO] -----------------------------------------------------------------------"

URI.extract(str)
# => ["http://sonar.company.com/dashboard/index/com.company.paas.maventestproject:MavenTestProject",
#     "http://sonar.company.com/api/ce/task?id=AVhFxTkyob-dgWZqnfIn"]

然后，只需找到您想要的链接并使用它即可。

您还需要关注URI为聚会带来的所有其他方法，因为它了解如何根据RFC拆分和构建URI。

不要滚动自己的代码或正则表达式去做别人做过的事情，特别是当代码经过充分测试时。你将避免其他人陷入的陷阱。 URI的作者/维护者管理内置模式，所以我们没有必要。并且，它比您想象的要复杂得多，例如：

URI::REGEXP::PATTERN::ABS_URI
"[a-zA-Z][\\-+.a-zA-Z\\d]*:(?:(?://(?:(?:(?:[\\-_.!~*'()a-zA-Z\\d;:&=+$,]|%[a-fA-F\\d]{2})*@)?(?:(?:[a-zA-Z0-9\\-.]|%\\h\\h)+|\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|\\[(?:(?:[a-fA-F\\d]{1,4}:)*(?:[a-fA-F\\d]{1,4}|\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})|(?:(?:[a-fA-F\\d]{1,4}:)*[a-fA-F\\d]{1,4})?::(?:(?:[a-fA-F\\d]{1,4}:)*(?:[a-fA-F\\d]{1,4}|\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}))?)\\])(?::\\d*)?|(?:[\\-_.!~*'()a-zA-Z\\d$,;:@&=+]|%[a-fA-F\\d]{2})+)(?:/(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*(?:;(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*)*(?:/(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*(?:;(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*)*)*)?|/(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*(?:;(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*)*(?:/(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*(?:;(?:[\\-_.!~*'()a-zA-Z\\d:@&=+$,]|%[a-fA-F\\d]{2})*)*)*)(?:\\?(?:(?:[\\-_.!~*'()a-zA-Z\\d;/?:@&=+$,\\[\\]]|%[a-fA-F\\d]{2})*))?|(?:[\\-_.!~*'()a-zA-Z\\d;?:@&=+$,]|%[a-fA-F\\d]{2})(?:[\\-_.!~*'()a-zA-Z\\d;/?:@&=+$,\\[\\]]|%[a-fA-F\\d]{2})*)"

如何使用正则表达式来获取字符串的特定部分

2 个答案: