Question

我尝试用JavaScript编写一个非常简单的爬虫（在Firefox中测试）。

我使用ES6 fetch函数以这种方式获取文档：

fetch(url)
  .then(response => response.text())
  .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
  .then(doc => {
     doc.querySelectorAll('a').forEach(node => {
       fetch(node.href)
         .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
         .then(doc => {
           doc.querySelectorAll('a').forEach(node => {
             console.log (node.href);
           });
         });
     });
  });

问题如下引自MDN

当通过调用新的DOMParser（）实例化DOMParser时，它继承了调用代码的主体（除了对于chrome调用者，主体设置为空主体）和窗口的documentURI和baseURI构造函数来自。

只要URL与窗口的URL相同，第一次提取就可以正常工作。但是对于querySelectorAll，我从获取的页面中收集不同的锚点，以便获取这些页面以为每个URL创建DOM树。问题是，parseFromString创建的DOM树有错误的documentURL。 parseFromString不接受任何网址参数，而是从documentURL继承window。但这显然是错误的URL。这意味着获取的文档中的所有相对链接都将被破坏。

如何从字符串解析文档并设置正确的documentURL？

(new DOMParser()).parseFromString('<html></html>', 'text/html')

属性URL和documentURL都是只读的。

Answer 1

你可以尝试这样的事情。只需手动跟踪正确的原点。

// Save the origin of the original request.
var origin1 = new URL(url).origin

fetch(url)
  .then(response => response.text())
  .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
  .then(doc => {
     doc.querySelectorAll('a').forEach(node => {
       // Check if node's href is absolute or relative.
       var href = node.getAttribute('href') // use this instead of node.href (node.href is always absolute)
       if (!href.match(/https?:\/\//) {
         // this is a relative url, so
         href = origin1 + href;
       }

       fetch(href)
         .then(text => (new DOMParser()).parseFromString (text, 'text/html'))
         .then(doc => {
           doc.querySelectorAll('a').forEach(node => {
             // See above, check if relative and append to correct
             // origin if so.
             // console.log (node.href);
           });
         });
     });
  });

Answer 2

如果正确解释问题，则document网址将是用于获取HTML的href元素的<a>。

如何解析字符串中的文档并设置正确的`documentURL`？

2 个答案: