Question

我有一个字符串

<h1>hello/h1>
<script src="http://www.test.com/file1.js"></script>
<script src="http://www.test.com/file2.js"></script>
<p>bye</p>

我需要使用在字符串中找到的网址生成一个数组。

['http://www.test.com/file1.js', 'http://www.test.com/file2.js']

我还需要将整行（包括标签脚本标签）全部替换为空。

到目前为止，这是我找到的网址

^(<script src=")(.*)("><\/script>)$

问题在于它仅适用于

<script src="http://www.test.com/file1.js"></script>

如果我这样定义我的脚本

<script id="something" src="http://www.test.com/file1.js"></script>

它不起作用。

Answer 1

请考虑使用适当的HTML解析器，例如cheerio：找到<script>标签，将其删除，然后将其src推入数组：

const cheerio = require('cheerio');

const htmlStr = `<h1>hello/h1>
<script src="http://www.test.com/file1.js"></script>
<script src="http://www.test.com/file2.js"></script>
<p>bye</p>`;
const $ = cheerio.load(htmlStr);

const urls = [];
$('script').each((_, script) => {
  urls.push(script.src);
  $(script).remove();
});
const result = $('body').html();
console.log(result);

Answer 2

要仅获取网址，您可以执行以下操作：

^<script.*?src="(.*)".*?><\/script>$

这可以捕获属性在src属性之前和之后的情况。

Answer 3

This RegEx可能会帮助您获取这些URL：

^<.+="(.+)"><\/.+>$

它将创建一个单独的组，您的目标URL在那里，并过滤所有其他内容。它也可以与<a>标签和其他具有打开和关闭模式的相似标签一起使用。

Answer 4

使用此插件

^(<script )(.*)(src=")(.*)("><\/script>)$

第4组是网址

或^(?:<script )(?:.*)(?:src=")(.*)(?:"><\/script>)$以使用非捕获组。

正则表达式，用于在字符串中查找URL

4 个答案: