我正在编写一个数据抓取脚本。目的是从BT网站收集有关可用宽带交易的数据。我无法弄清楚为什么我的简单请求代码没有填写表格并进入下一页。
请帮助我弄清楚如何在此表单中输入数据并保存输出html以进行数据抓取。
我已经以感兴趣的形式标识了相关标签。我正在尝试填充UPRN字段并继续下一页
链接到网站:https://www.dslchecker.bt.com/#
我的python代码: '''python
import requests
url = "https://www.dslchecker.bt.com/#"
payload = {'UPRN':'10033360983'}
r = requests.post(url, data = payload)
print(r.text)
'''
网站上的表格:
'''html
<form method="post" action="adsl/ADSLChecker.UPRNoutput"><input type="hidden" name="URL"> <input value="a%20service%20provider" type="hidden" name="SP_NAME">
<span class="subheading">UPRN:</span><br><input maxlength="13" size="14" name="UPRN" autocomplete="off" style="background-image: url("data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAPhJREFUOBHlU70KgzAQPlMhEvoQTg6OPoOjT+JWOnRqkUKHgqWP4OQbOPokTk6OTkVULNSLVc62oJmbIdzd95NcuGjX2/3YVI/Ts+t0WLE2ut5xsQ0O+90F6UxFjAI8qNcEGONia08e6MNONYwCS7EQAizLmtGUDEzTBNd1fxsYhjEBnHPQNG3KKTYV34F8ec/zwHEciOMYyrIE3/ehKAqIoggo9inGXKmFXwbyBkmSQJqmUNe15IRhCG3byphitm1/eUzDM4qR0TTNjEixGdAnSi3keS5vSk2UDKqqgizLqB4YzvassiKhGtZ/jDMtLOnHz7TE+yf8BaDZXA509yeBAAAAAElFTkSuQmCC"); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;"> <input value="56" type="hidden" name="VERSION"> <input value="E" type="hidden" name="MS"> <input value="no" type="hidden" name="CAP"> <input value="Y" type="hidden" name="AEA"> <input class="form_button" value="submit" type="submit"> </form>
'''
请点击此链接:https://www.dslchecker.bt.com/#,然后在UPRN字段中输入10033346575,以查看所需的输出
在jupyter笔记本中运行时的输出:
'''html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0063)http://dslcheckerait.vade.bt.com:61065/adsl/adslchecker.welcome -->
<HTML><HEAD>
<STYLE>
.body {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #004d5f; FONT-SIZE: 11px; FONT-WEIGHT: normal; TEXT-DECORATION: none
}
.bodybold {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #333333; FONT-SIZE: 11px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.errormessage {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #000000; FONT-SIZE: 11px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.formDescription {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #666666; FONT-SIZE: 9px; FONT-WEIGHT: normal; TEXT-DECORATION: none
}
.form_button {BORDER-BOTTOM: #666666 1px solid; BORDER-LEFT: #666666 1px solid; BACKGROUND-COLOR: #6400AA; FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #ffffff; FONT-SIZE: 10px; BORDER-TOP: #666666 1px solid; FONT-WEIGHT: bold; BORDER-RIGHT: #666666 1px solid; TEXT-DECORATION: none
}
.heading {FONT-VARIANT: normal; FONT-FAMILY: Arial, Helvetica, sans-serif; COLOR: #004d5f; FONT-SIZE: 14px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.heading3 {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #333333; FONT-SIZE: 10px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.heading4 {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #91b1b8; FONT-SIZE: 10px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.subheading {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Helvetica, sans-serif; COLOR: color: #333333; FONT-SIZE: 14px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
A:active {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: underline
}
A:hover {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: underline
}
A:link {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
A:visited {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: underline
}
BODY {PADDING-BOTTOM: 0px; BACKGROUND-COLOR: #ffffff; MARGIN: 10px; PADDING-LEFT: 0px; PADDING-RIGHT: 0px; PADDING-TOP: 0px
}
</STYLE>
<TITLE>BT Broadband</TITLE>
<META content="text/html; charset=utf-8" http-equiv=Content-Type><LINK
rel=stylesheet type=text/css
href="adslchecker_font.html">
<META content=text/css http-equiv=Content-Style-Type><META http-equiv="X-UA-Compatible" content="IE=5">
<SCRIPT>
<!--
function setFocus() {
document.forms[0].elements[2].focus();
}
//-->
</SCRIPT>
<META name=GENERATOR content="MSHTML 8.00.7601.18751"></HEAD>
<BODY onload=setFocus()>
<TABLE width=500 align=center>
<TBODY>
<TR>
<TD>
<SCRIPT language=JavaScript> var isNS = (navigator.appName == "Netscape") ? 1 : 0;var EnableRightClick = 0;if(isNS) document.captureEvents(Event.MOUSEDOWN||Event.MOUSEUP);function mischandler(){if(EnableRightClick==1){ return true;}else {return false; }}function mousehandler(e){ if(EnableRightClick==1){ return true; } var myevent = (isNS) ? e : event; var eventbutton = (isNS) ? myevent.which : myevent.button; if((eventbutton==2)||(eventbutton==3)) return false;}function keyhandler(e) {var myevent = (isNS) ? e : window.event;if (myevent.keyCode==96)EnableRightClick = 1;return;}document.oncontextmenu = mischandler;document.onkeypress = keyhandler;document.onmousedown = mousehandlerdocument.onmouseup = mousehandler;</SCRIPT>
<TABLE border=0 cellSpacing=0 cellPadding=0 width="100%"><!-- Start Header -->
<TBODY>
<TR><BR><BR>
<!--<TD height=20 vAlign=top align=left><IMG border=0 alt="BT Wholesale"
src="dsl_images/g_main_logo.gif" width=129
height=20></TD></TR>
<TR>
<TD class=body height=14 vAlign=top align=left><IMG alt=""
src="dsl_images/spacer.gif" width=450 height=14></TD></TR>
<TR>//-->
<TD class=body vAlign=top align=left fontStyle="italic">
<TABLE border=0 cellSpacing=0 cellPadding=0 width=450><!-- Start Page Title -->
<TBODY>
<TR>
<TD height=45 vAlign=top width=600 align=left><FONT
style="FONT-FAMILY: Calibri Light" color=#6400AA size=6.5><B> BT BROADBAND
AVAILABILITY
CHECKER</B></FONT></TD></TR><!-- End Page Title --></TD></TR></TBODY></TABLE></TD></TR></TBODY></TABLE><SPAN
class=body><!--RESPONSE-START-->
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Welcome to the Broadband Availability checker. This
will provide a provisional check of your ability to receive reliable
Broadband services.</font></SPAN></P>
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Please enter your telephone number.</font></SPAN></P>
<FORM method=post action=adsl/adslchecker.TelephoneNumberOutput><INPUT
type=hidden name=URL> <INPUT value=a%20service%20provider type=hidden
name=SP_NAME> <SPAN class=subheading>TELEPHONE:</SPAN><BR><INPUT
maxLength=14 size=14 name=TelNo> <INPUT value=56 type=hidden name=VERSION>
<INPUT value=E type=hidden name=MS> <INPUT value=no type=hidden name=CAP>
<INPUT value=Y type=hidden name=AEA> <INPUT class=form_button value=submit type=submit> </FORM>
<P><SPAN class=body>Or</SPAN></P>
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Please enter your access line id.</font></SPAN></P>
<FORM method=post action=adsl/adslchecker.AccessLineIDOutput><INPUT type=hidden
name=URL> <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
<SPAN class=subheading>ACCESS LINE ID:</SPAN><BR><INPUT maxLength=13
size=14 name=ALID> <INPUT value=56 type=hidden name=VERSION> <INPUT
value=E type=hidden name=MS> <INPUT value=no type=hidden name=CAP> <INPUT
value=Y type=hidden name=AEA> <INPUT class=form_button value=submit type=submit> </FORM>
<P><SPAN class=body>Or</SPAN></P>
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Please enter your UPRN.</font></SPAN></P>
<FORM method=post action=adsl/ADSLChecker.UPRNoutput><INPUT type=hidden
name=URL> <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
<SPAN class=subheading>UPRN:</SPAN><BR><INPUT maxLength=13
size=14 name=UPRN> <INPUT value=56 type=hidden name=VERSION> <INPUT
value=E type=hidden name=MS> <INPUT value=no type=hidden name=CAP> <INPUT
value=Y type=hidden name=AEA> <INPUT class=form_button value=submit type=submit> </FORM>
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">If you do not have a telephone number or access line
id, please select the</font>
<TABLE>
<TR>
<FORM method=post action=adsl/adslchecker.address>
<INPUT value="" type=hidden name=url>
<INPUT value=a%20service%20provider type=hidden name=SP_NAME>
<INPUT value=56 type=hidden name=VERSION>
<INPUT value=E type=hidden name=MS>
<INPUT value=no type=hidden name=CAP>
<INPUT value=Y type=hidden name=AEA>
<TD><A href=# onclick="document.forms[3].submit()">Address Checker</A></TD>
</FORM>
<FONT>
<TH><P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">or the</font></SPAN></P></TH>
</FONT>
<FORM method=post action=adsl/adslchecker.postcode>
<TD><A href=# onclick="document.forms[4].submit()">Postcode Checker</A></TD>
<INPUT value="" type=hidden name=url>
<INPUT value=a%20service%20provider type=hidden name=SP_NAME>
<INPUT value=56 type=hidden name=VERSION>
<INPUT value=E type=hidden name=MS>
<INPUT value=no type=hidden name=CAP>
<INPUT value=Y type=hidden name=AEA>
</FORM>
<FONT>
<TH><P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">or the</font></SPAN></P></TH>
</FONT>
<FORM method=post action=adsl/adslchecker.bbeuidform>
<TD><A href=# onclick="document.forms[5].submit()">BBEU Checker</A></TD>
<INPUT value="" type=hidden name=url>
<INPUT value=a%20service%20provider type=hidden name=SP_NAME>
<INPUT value=56 type=hidden name=VERSION>
<INPUT value=E type=hidden name=MS>
<INPUT value=no type=hidden name=CAP>
<INPUT value=Y type=hidden name=AEA>
</FORM>
</TR>
</TABLE>
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">By submitting a query into this checker you accept <A
href="https://www.btwholesale.com/pages/static/terms-of-use.htm" target="_blank">Terms of Use</A> for this checker.</font>
<!--RESPONSE-END--></SPAN></P></SPAN></TD></TR></TBODY></TABLE></BODY></HTML>
'''
答案 0 :(得分:1)
所以1)您将发布到错误的URL。
从返回的HTML中,所需表单的“操作”为“ adsl / ADSLChecker.UPRNoutput”
2)您未提交的表单中有隐藏字段
<form method="post" action="adsl/ADSLChecker.UPRNoutput">
<input type="hidden" name="URL">
<input value="a%20service%20provider" type="hidden" name="SP_NAME">
<span class="subheading">UPRN:</span><br>
<input maxlength="13" size="14" name="UPRN">
<input value="56" type="hidden" name="VERSION">
<input value="E" type="hidden" name="MS">
<input value="no" type="hidden" name="CAP">
<input value="Y" type="hidden" name="AEA">
<input class="form_button" value="submit" type="submit">
</form>
尝试:
payload = {
"UPRN": "10033360983",
"SP_NAME": "a%20service%20provider",
"VERSION": "56",
"MS": "E",
"CAP": "no",
"AEA": "Y"
}
url = 'https://www.dslchecker.bt.com/adsl/ADSLChecker.UPRNoutput'
r = requests.post(url, data = payload)
答案 1 :(得分:0)
您发布的网址错误。我用熊猫拉桌子,所以您需要做一些清理工作,但是请尝试:
import requests
import pandas as pd
url = 'https://www.dslchecker.bt.com/adsl/ADSLChecker.UPRNoutput'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
UPRN = 10033346575
payload = {
'URL': '',
'SP_NAME': 'a%20service%20provider',
'UPRN': str(UPRN),
'VERSION': '56',
'MS': 'E',
'CAP': 'no',
'AEA': 'Y'}
response = requests.post(url, headers=headers, params=payload)
tables = pd.read_html(response.text)
df = tables[-1]
输出:
print(df.to_string())
Featured Products Downstream Line Rate(Mbps) Upstream Line Rate(Mbps) Downstream Handback Threshold(Mbps) WBC FTTC Availability Date WBC SOGEA Availability Date Unnamed: 8_level_0
Unnamed: 0_level_1 High Low High Low Unnamed: 5_level_1 Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1
0 VDSL Range A (Clean) 3 1.2 1.2 0.8 0.8 Available Available NaN
1 VDSL Range B (Impacted) 2.8 1.2 1.2 0.5 0.8 Available Available NaN
2 Featured Products Downstream Line Rate(Mbps) Downstream Line Rate(Mbps) Upstream Line Rate(Mbps) Upstream Line Rate(Mbps) Downstream Range(Mbps) WBC FTTP Availability Date NaN NaN
3 FTTP on Demand 330 330 30 30 -- Available -- NaN
4 ADSL Products Downstream Line Rate(Mbps) Downstream Line Rate(Mbps) Upstream Line Rate(Mbps) Upstream Line Rate(Mbps) Downstream Range(Mbps) Availability Date NaN NaN
5 WBC ADSL 2+ Up to 1 Up to 1 -- -- 1 to 3.5 Available -- NaN
6 ADSL Max Up to 1 Up to 1 -- -- 0.75 to 2.5 Available -- NaN
7 WBC Fixed Rate 0.5 0.5 -- -- -- Available -- NaN
8 Fixed Rate 0.5 0.5 -- -- -- Available -- NaN
9 Observed Speeds VDSL VDSL NaN NaN NaN NaN NaN NaN
10 Other Offerings NaN NaN NaN NaN NaN Availability Date NaN NaN
11 VDSL Multicast -- -- -- -- -- Available -- NaN
12 ADSL Multicast -- -- -- -- -- Available -- NaN