抓取:由于结构变化,网页抓取停止

时间:2020-09-28 11:33:42

标签: html web-scraping web-crawler

在抓取网页时,网页的结构不断变化,我的意思是其动态变化导致抓取器停止工作的情况。是否有一种机制可以在运行完整的搜寻器之前识别网页的结构变化,从而确定结构是否发生了变化。

2 个答案:

答案 0 :(得分:0)

如果您可以在网页中运行自己的javascript代码,则可以使用autopep8来监视对DOM树所做的更改。

类似的东西:

waitForDomStability(timeout: number) {
  return new Promise(resolve => {

  const waitResolve = observer => {
    observer.disconnect();
    resolve();
  };

  let timeoutId;
  const observer = new MutationObserver((mutationList, observer) => {
    for (let i = 0; i < mutationList.length; i += 1) {
      // we only care if new nodes have been added
      if (mutationList[i].type === 'childList') {
        // restart the countdown timer
        window.clearTimeout(timeoutId);
        timeoutId = window.setTimeout(waitResolve, timeout, observer);
        break;
      }
    }
  });

  timeoutId = setTimeout(waitResolve, timeout, observer);

  // start observing document.body
  observer.observe(document.body, { attributes: true, childList: true, subtree: true });
  });
}

我在开源抓取扩展MutationObserver中使用了这种方法。有关完整代码,请从回购中查看/packages/background/src/ts/plugins/builtin/FetchPlugin.ts。

答案 1 :(得分:0)

您当然可以使用“快照”来比较同一页面的2个版本。我已经实现了类似于java String hashCode的方法来实现此目的。

JavaScript代码:

/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
    const snapshotSelector = 'body';
    const nodeToBeHashed = document.querySelector(snapshotSelector);
    if (!nodeToBeHashed) return 0;

    const { innerText } = nodeToBeHashed;

    let hash = 0;
    if (innerText.length === 0) {
      return hash;
    }

    for (let i = 0; i < innerText.length; i += 1) {
      // an integer between 0 and 65535 representing the UTF-16 code unit
      const charCode = innerText.charCodeAt(i);

      // multiply by 31 and add current charCode
      hash = ((hash << 5) - hash) + charCode;

      // convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
      hash |= 0;
    }

    return hash;
}

如果您无法在页面中运行javascript代码,则可以将整个html响应用作要用您喜欢的语言进行哈希处理的内容。