因此,当我在PyCharm / shell脚本中运行本地计算机时,以下代码可以正常工作:
# -*- coding: utf-8 -*-
import requests
from lxml import etree, html
import chardet
def gimme_pairs():
url = "https://halbidoncom/sha.xml"
page = requests.get(url).content
encoding = chardet.detect(page)['encoding']
if encoding != 'utf-8':
page = page.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(page, base_url=url)
print(doc)
print(page)
wanted = doc.xpath('//location')
print(wanted)
date_list = None
tashkif_list = None
for elem in wanted:
date_list = elem.xpath('locationdata/timeunitdata/date/text()')
tashkif_list = elem.xpath('locationdata/timeunitdata/element/elementvalue/text()')
但是在PythonAnywhere上我获得了doc
的输出:
B' \ n \ n \ nChallenge = 355121; \ nChallengeId = 58551073; \ nGenericErrorMessageCookies ="饼干 必须启用才能查看此内容 页面。"; \ n \ n \ n \ n功能测试(var1)\ n {\ n \ t \ t \ t \ t \ t \ t var_str ="" + Challenge; \ n \ tvar var_arr = var_str.split(""); \ n \ t \ tvar LastDig = VAR _arr.reverse()[0]; \ n \ tvar minDig = var_arr.sort()[0]; \ n \ tvar subvar1 =(2 *(var_arr [2]))+(var_arr [1] * 1); \ n \ tvar subvar2 =(2 * var_arr [2])+ v ar_arr [1]; \ n \ TVAR my_pow = Math.pow(((var_arr [0] * 1)+2),var_arr [1]); \ n \ TVAR x =(var1 * 3 + subvar1) 1; \ n \ tvar y = Math.cos(Math.PI subvar2); \ n \ t变量a nswer = X * Y; \ n \ tanswer- = my_pow * 1; \ n \ tanswer + =(minDig * 1) - (* LastDig 1); \ n \ tanswer =回答+ subvar2; \ n \ treturn 回答; \ n} \ n \ n \ ncl ent = null; \ nif (window.XMLHttpRequest)\ n {\ n \ t \ t \ t \ t \ t \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n \ n 的XMLHttpRequest(); \ N} \ nelse \ N {\ n \ TIF (window.ActiveXObject)\ n \ t {\ n \ t \ tclient = new 的ActiveXObject(\' MSXML2.XMLHTTP.3.0 \&#39); \ n \吨}; \ N} \ NIF !(((!!客户端)及及(!! Math.pow)及及(!! Math.cos)及及(!! []排序)及;及(!! [ ] .reverse)))\ N {\ n \ tdocu ment.write("并非所有需要的JavaScript方法都是 支持。
"); \ n \ n} \ nelse \ n {\ n \ tclient.onreadystatechange = function()\ n \ t {\ n \ t \ tif(c lient.readyState == 4)\ n \ t \ t {\ n \ t \ t \ t \ ttvar 的myCookie = client.getResponseHeader(" X-AA-Cookie的值&#34); \ n \吨\吨\ TIF ((MyCookie == null)||(MyCooki È=="&#34))\ n \吨\吨\吨{\ n \吨\吨\吨\ tdocument.write(client.responseText); \ n \吨\吨\吨\ treturn ; \ n \吨\吨\吨} \ n \吨\吨\吨\ n \吨\吨\ TVAR cookieName = MyCookie.split(\' = \')[0]; \ n \ t \ t \ tif (document.cookie.indexOf(cookieName)== - 1)\ n \吨\吨\吨{\ n \吨\吨\吨\ tdocument.write(GenericErrorMessageCookies); \ n \吨\吨\吨\ treturn; \ Ñ\吨\吨\吨} \ n \吨\吨\ twindow.location.reload(真); \ n \吨\吨} \ n \吨}; \ n \ TY =试验(挑战); \ n \ tclient.open(" POST",window.location的,TRUE); \ n \ tclient.set RequestHeader(\' X-AA-挑战-ID \&#39 ;, ChallengeId); \ n \ tclient.setRequestHeader(\' X-AA-质询结果\',Y); \ n \ tclient.setRequestHeader(\' X- AA-挑战\',挑战); \ n \ tclient.setRequestHeader(\'内容类型\' ,\' text / plain \'); \ n \ tclient.send(); \ n} \ n \ n \ n \ n 必须启用nJavaScript才能查看此内容 页面\ n \ n'
我尝试的事情:
是什么给出的?令我印象深刻的是,请求应该在我的机器和他们的机器上具有相同的功能。
答案 0 :(得分:3)
看起来您尝试抓取的服务器具有保护功能,可以确保您使用真正的浏览器/请求后面的人。如果您很好地格式化该响应,您会发现它在页面上使用Challenge
和ChallengeId
设置了一些标题。
我假设PythonAnywhere使用的IP /服务器已被服务器所有者添加到列表中以阻止请求(过去可能有人真的发送过垃圾邮件?)
仔细查看相同的标题,我发现这个项目似乎解决了同样的问题:https://github.com/niryariv/opentaba-server/
他们检查了挑战:https://github.com/niryariv/opentaba-server/blob/master/lib/mavat_scrape.py#L31并使用此助手解析他们:https://github.com/niryariv/opentaba-server/blob/master/lib/helpers.py#L109
希望有所帮助!