在抓取网页时,网页的结构不断变化,我的意思是其动态变化导致抓取器停止工作的情况。是否有一种机制可以在运行完整的搜寻器之前识别网页的结构变化,从而确定结构是否发生了变化。
答案 0 :(得分:0)
如果您可以在网页中运行自己的javascript代码,则可以使用autopep8来监视对DOM树所做的更改。
类似的东西:
waitForDomStability(timeout: number) {
return new Promise(resolve => {
const waitResolve = observer => {
observer.disconnect();
resolve();
};
let timeoutId;
const observer = new MutationObserver((mutationList, observer) => {
for (let i = 0; i < mutationList.length; i += 1) {
// we only care if new nodes have been added
if (mutationList[i].type === 'childList') {
// restart the countdown timer
window.clearTimeout(timeoutId);
timeoutId = window.setTimeout(waitResolve, timeout, observer);
break;
}
}
});
timeoutId = setTimeout(waitResolve, timeout, observer);
// start observing document.body
observer.observe(document.body, { attributes: true, childList: true, subtree: true });
});
}
我在开源抓取扩展MutationObserver中使用了这种方法。有关完整代码,请从回购中查看/packages/background/src/ts/plugins/builtin/FetchPlugin.ts。
答案 1 :(得分:0)
您当然可以使用“快照”来比较同一页面的2个版本。我已经实现了类似于java String hashCode的方法来实现此目的。
JavaScript代码:
/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
const snapshotSelector = 'body';
const nodeToBeHashed = document.querySelector(snapshotSelector);
if (!nodeToBeHashed) return 0;
const { innerText } = nodeToBeHashed;
let hash = 0;
if (innerText.length === 0) {
return hash;
}
for (let i = 0; i < innerText.length; i += 1) {
// an integer between 0 and 65535 representing the UTF-16 code unit
const charCode = innerText.charCodeAt(i);
// multiply by 31 and add current charCode
hash = ((hash << 5) - hash) + charCode;
// convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
hash |= 0;
}
return hash;
}
如果您无法在页面中运行javascript代码,则可以将整个html响应用作要用您喜欢的语言进行哈希处理的内容。