我一直在尝试从网页中提取一些数据,它使用了一些特殊的方法来检测需要绕过的机器人。
我首先不得不绕过烦人的CAPTCHA
,但现在又出现了另一个问题。
该网页使用(似乎是)随机链接生成器为我提供了我想要的数据。在浏览器上,只有一个按钮可见,但是从源头看,我在同一区域看到多个随机生成的按钮,如下所示:
...
<div id='BA405352A9' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=A230134'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='075A1762B3' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=7294A7B'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='453A2A0469' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=9646432'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='302B375583' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=933A29B'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='08171153B4' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=3447182'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='20813B7B10' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=6B96496'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='6661917AB6' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=9AA8604'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div><div id='1646980B02' style='display:none;'><button type="button" value="Upgrade level" class="build" onclick="window.location.href = 'dorf2.php?a=20&c=5841731'; return false;">
<div class="button-container"><div class="button-position"><div class="btl"><div class="btr"><div class="btc"></div></div></div>
<div class="bml"><div class="bmr"><div class="bmc"></div></div></div><div class="bbl"><div class="bbr"><div class="bbc"></div></div></div>
</div><div class="button-contents">Enter</div></div></button></div></div><script language="javascript">
...
根据消息来源,似乎初始HTTP GET请求仅包含不可见的按钮,并且在CSS
加载后以某种方式使“正确”按钮变为可见?
我对这种设计(或一般设计网站)没有经验。它们如何工作?而我该如何模仿浏览器的行为来希望绕过它们呢?
答案 0 :(得分:0)
我终于能够访问数据,事实证明,CSS
标签是在页面加载时由某些Javascript设置的。在查看了脚本之后,我发现需要首先提取很多正在生成的数据(可能是服务器端)。
经过数小时的查找,我终于能够找到Javascript用于编辑数据的功能。有一堆,服务器随机分配了使用它们的顺序,以进一步模糊破解算法的尝试:
function showbt(sid) {
return (dM(aM(bM(fM(gM(cM(sid)))))))
}
这里,顺序是随机生成的,其中两个功能被注入到网页源中,并且每次都必须替换。
我能够将Javascript完全转换为Python,并使用re
和requests
提取和更新函数及其使用顺序,然后使用生成的Python代码最终破解加密。
(翻译示例:)
var _0x7052 = ["", "\x6C\x65\x6E\x67\x74\x68", "\x73\x75\x62\x73\x74\x72", "\x69\x6E\x64\x65\x78\x4F\x66"];
function aarf(_0xb5a3x2) {
var _0xb5a3x3 = 0;
var _0xb5a3x4 = 0;
var _0xb5a3x5 = _0x7052[0];
for (i = 0; i < _0xb5a3x2[_0x7052[1]]; i += 1) {
_0xb5a3x3 = stream[_0x7052[3]](_0xb5a3x2[_0x7052[2]](i, 1));
_0xb5a3x3 = _0xb5a3x3 * _0xb5a3x3 + 6 * _0xb5a3x3 + 6246;
_0xb5a3x3 = _0xb5a3x3 % stream[_0x7052[1]];
_0xb5a3x5 += stream[_0x7052[2]](_0xb5a3x3, 1);
};
return _0xb5a3x5;
};
将UTF-8转换为文本(此处用于混淆代码):
var _0x7052 = ["", "length", "substr", "indexOf"];
function aarf(_0xb5a3x2) {
var _0xb5a3x3 = 0;
var _0xb5a3x4 = 0;
var _0xb5a3x5 = _0x7052[0];
for (i = 0; i < _0xb5a3x2[_0x7052[1]]; i += 1) {
_0xb5a3x3 = stream[_0x7052[3]](_0xb5a3x2[_0x7052[2]](i, 1));
_0xb5a3x3 = _0xb5a3x3 * _0xb5a3x3 + 6 * _0xb5a3x3 + 6246;
_0xb5a3x3 = _0xb5a3x3 % stream[_0x7052[1]];
_0xb5a3x5 += stream[_0x7052[2]](_0xb5a3x3, 1);
};
return _0xb5a3x5;
};
最后,在替换JS的数组函数并用Python重写后,我们得到:
def aarf(_0xb5a3x2) :
_0xb5a3x3 = 0
_0xb5a3x4 = 0
_0xb5a3x5 = ""
for i in range(0, len(_0xb5a3x2), 1):
_0xb5a3x3 = stream.index(_0xb5a3x2[i:i+1])
_0xb5a3x3 = _0xb5a3x3 * _0xb5a3x3 +6 * _0xb5a3x3 +6246 #REPNUM2
_0xb5a3x3 = _0xb5a3x3 % len(stream)
_0xb5a3x5 += stream[math.ceil(_0xb5a3x3):math.ceil(_0xb5a3x3)+1]
return _0xb5a3x5
#note that the REPNUM comment indicates there are 2 randomly generated numbers in this line, and they'll have to be extracted from the webpage and injected into this code.
但还没有...
按钮本身是通过网页生成的,并且其中的ID也已加密,因此我必须执行与上述相同的步骤来解密按钮ID。
剩下的就是将解密的按钮ID与解密的Javascript代码的输出进行匹配,并找到要使用的正确按钮!
对于任何想做类似事情的人,请记住,用于解密右键的Javascript总是以某种方式包含在网页中(否则您的浏览器也找不到正确的!),因此您要做的就是仔细分析网页及其工作方式,然后尝试对其行为进行逆向工程以利用加密。
我以前没有使用javascript或HTML的经验就可以做到这一点,所以如果可以的话,您也可以!
另一个解决方法是使用Selenium
,但这并没有使用旧的requests
的力量和速度的一部分!