正则表达式,用于在包含URL的字符串中匹配字母数字

时间:2019-02-02 17:35:31

标签: javascript regex google-apps-script

考虑到一些情况,如何在包含URL的字符串中匹配和提取字母数字字符(和符号)?我目前正在使用Google Apps脚本从Gmail线程消息中检索超链接文本的纯文本,并且我基本上想匹配并从某些字符串中提取标题,如下所示:

var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

...,我只想在其中输出:"Testing: Stack Overflow Title 123?"

这是另一种情况:

var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

...再次,我只想输出:"Testing: Stack Overflow Title 123?"

我已经尝试了以下方法来进行初始测试,以查看字符串是否首先包含URL(在该示例中,我确认用于匹配URL的正则表达式可以正常工作并输出:https://www.stackoverflow.com),然后进行测试以查看是否存在标题以最终将其提取,但无济于事:

var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";
var urlRegex = /(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/;
var titleRegex = /^[a-zA-Z0-9_:?']*$/;
var containsUrl = urlRegex.test(element);
if (containsUrl) {
    var containsTitle = titleRegex.test(scenario1);
    if (containsTitle) { // No match, and doesn't run
      var title = titleRegex.exec(element)[0];
      Logger.log("title: " + title);
    }
}

基本上,我希望有一个正则表达式模式,该模式可以匹配所有内容,但可以匹配网址

3 个答案:

答案 0 :(得分:2)

使用此正则表达式,我们可以捕获除URL之外的任何顺序文本,

(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)

说明:

  • (?:^|\s)-匹配行首或一个或多个空格
  • ((?:(?!:\/\/).)*)-匹配除包含://的文本以外的所有文本,从字面上将其标识为URL
  • (?=\s|$)-正向超前,以确保后面跟随空格或行尾

Demo

这将匹配并捕获URL以外的任何顺序文本。希望这对您有用。

这是一个Javascript演示。

var arr = ['Testing1: Stack Overflow Title 123? https://www.stackoverflow.com','https://www.stackoverflow.com    Testing2: Stack Overflow Title xyz? https://www.stackoverflow.com Hello this is simple text ftp://www.downloads.com/']

for (s of arr) {
	var reg = /(?:^|\s+)((?:(?!:\/\/).)*)(?=\s|$)/g;
	match = reg.exec(s);
	while (match != null) {
		console.log(match[1])
		match = reg.exec(s);
	}
}

此外,正如我所看到的,您要限制匹配标题中的字符,因此可以使用字符集[a-zA-Z0-9_:?' ](字符集中添加空格也可以捕获空格),而不要使用{{1 }},并使用以下正则表达式来更精确地避免捕获具有意外字符的标题,

.

Demo with your title character set

答案 1 :(得分:1)

一种可能性是匹配,直到您使用组或正向超前碰到第一个网址为止。

使用看起来像这样的积极前瞻

\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)

const regexLookahead = /\bTesting: .*?(?=\s*(?:https?|ftps?):\/\/)/;
[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"
].forEach(s => console.log(s.match(regexLookahead)[0]));

使用一个捕获组,您的值将在第一个捕获组中:

(\bTesting: .*?)\s*(?:https?|ftps?):\/\/

const regexGroup = /(\bTesting: .*?)\s*(?:https?|ftps?):\/\//;
[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com"
].forEach(s => console.log(s.match(regexGroup)[1]));

如果要保留除URL以外的所有URL,可以将它们匹配并替换为空字符串:

\s*(?:https?|ftps?):\/\/\S+

[
  "Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com",
  "https://www.stackoverflow.com test https://www.stackoverflow.com test https://www.stackoverflow.com test",
  "https://www.stackoverflow.com test",
  "test https://www.stackoverflow.com"
].forEach(s => console.log(s.replace(/\s*(?:https?|ftps?):\/\/\S+/g, '').trim()));

答案 2 :(得分:0)

您可以使用.split()空格字符和.filter()结果数组来排除以指定协议开头或以word结尾,以点字符,然后以word和字符串结尾的元素

const splitURL = s => s.split` `.filter(w => !/^\w+(?=:\/\/)|\w+\.\w+$/.test(w)).join` `;
 
var scenario1 = "Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

var scenario2 = "https://www.stackoverflow.com Testing: Stack Overflow Title 123? https://www.stackoverflow.com";

console.log(splitURL(scenario1), splitURL(scenario2));