从文本中检索相对URL

时间:2019-12-16 00:46:01

标签: node.js regex geturl

我有一个包含绝对URL和相对URL的HTML字符串,我正尝试仅检索相对URL。我尝试使用get-urls包,但这只检索绝对URL。

收到的html字符串的示例。

<!DOCTYPE>
<html>
<head>

<title>Our first HTML page</title>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

</head>
<body>

<h2>Welcome to the web site: this is a heading inside of the heading tags.</h2>

<p>This is a paragraph of text inside the paragraph HTML tags. We can just keep writing ...
</p>

<h3>Now we have an image:</h3>

<div><img src="/images/plantTracing.gif" alt="Graphic of a Mouse Pad"></div>

<h3>
This is another heading inside of another set of headings tags; this time the tag is an 'h3' instead of an 'h2' , that means it is a less important heading.
</h3>

<h4>Yet another heading - right after this we have an HTML list:</h4>

<ol>
<li><a href="https://github.com/">First item in the list</a></li>
<li><a href="/modules/example.md"> Second item in the list</a></li>
<li>Third item in the list</li>
</ol>

<p>You will notice in the above HTML list, the HTML automatically creates the numbers in the list.</p>

<h3>About the list tags</h3>
</body>
</html>

当前正在这样做

getUrls(string of HTML received

它仅返回{https://github.com/}

我想返回{https://github.com//modules/example.md}

1 个答案:

答案 0 :(得分:0)

get-urls软件包要求URL以http://之类的方案开头,或者以已知的顶级域开头。

实际上,该文档甚至包含此 要求URL包含架构或领先的www。视为URL。

由于您要查找的路径都没有,所以该程序包将无法满足您的要求。

您可能会受益于诸如cheerio之类的实际HTML解析器,该解析器会基于HTML上下文找到基于HTML属性的URL,而不仅仅是基于文本匹配技巧,因为它将找到所有作为相对URL的路径。