可以使用python的响应模块填写和提交此表单吗?

时间:2019-05-31 10:26:19

标签: python html web-scraping python-requests jupyter-lab

我正在编写一个数据抓取脚本。目的是从BT网站收集有关可用宽带交易的数据。我无法弄清楚为什么我的简单请求代码没有填写表格并进入下一页。

请帮助我弄清楚如何在此表单中输入数据并保存输出html以进行数据抓取。

我已经以感兴趣的形式标识了相关标签。我正在尝试填充UPRN字段并继续下一页

链接到网站:https://www.dslchecker.bt.com/#

我的python代码: '''python

import requests
url = "https://www.dslchecker.bt.com/#"
payload = {'UPRN':'10033360983'}
r = requests.post(url, data = payload)
print(r.text)

'''

网站上的表格:

'''html

<form method="post" action="adsl/ADSLChecker.UPRNoutput"><input type="hidden" name="URL"> <input value="a%20service%20provider" type="hidden" name="SP_NAME">
      <span class="subheading">UPRN:</span><br><input maxlength="13" size="14" name="UPRN" autocomplete="off" style="background-image: url(&quot;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAASCAYAAABSO15qAAAAAXNSR0IArs4c6QAAAPhJREFUOBHlU70KgzAQPlMhEvoQTg6OPoOjT+JWOnRqkUKHgqWP4OQbOPokTk6OTkVULNSLVc62oJmbIdzd95NcuGjX2/3YVI/Ts+t0WLE2ut5xsQ0O+90F6UxFjAI8qNcEGONia08e6MNONYwCS7EQAizLmtGUDEzTBNd1fxsYhjEBnHPQNG3KKTYV34F8ec/zwHEciOMYyrIE3/ehKAqIoggo9inGXKmFXwbyBkmSQJqmUNe15IRhCG3byphitm1/eUzDM4qR0TTNjEixGdAnSi3keS5vSk2UDKqqgizLqB4YzvassiKhGtZ/jDMtLOnHz7TE+yf8BaDZXA509yeBAAAAAElFTkSuQmCC&quot;); background-repeat: no-repeat; background-attachment: scroll; background-size: 16px 18px; background-position: 98% 50%; cursor: auto;"> <input value="56" type="hidden" name="VERSION"> <input value="E" type="hidden" name="MS"> <input value="no" type="hidden" name="CAP"> <input value="Y" type="hidden" name="AEA"> &nbsp; <input class="form_button" value="submit" type="submit"> </form>

'''

请点击此链接:https://www.dslchecker.bt.com/#,然后在UPRN字段中输入10033346575,以查看所需的输出

在jupyter笔记本中运行时的输出:

'''html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0063)http://dslcheckerait.vade.bt.com:61065/adsl/adslchecker.welcome -->
<HTML><HEAD>
<STYLE>
.body {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #004d5f; FONT-SIZE: 11px; FONT-WEIGHT: normal; TEXT-DECORATION: none
}
.bodybold {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #333333; FONT-SIZE: 11px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.errormessage {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #000000; FONT-SIZE: 11px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.formDescription {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #666666; FONT-SIZE: 9px; FONT-WEIGHT: normal; TEXT-DECORATION: none
}
.form_button {BORDER-BOTTOM: #666666 1px solid; BORDER-LEFT: #666666 1px solid; BACKGROUND-COLOR: #6400AA; FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #ffffff; FONT-SIZE: 10px; BORDER-TOP: #666666 1px solid; FONT-WEIGHT: bold; BORDER-RIGHT: #666666 1px solid; TEXT-DECORATION: none
}
.heading {FONT-VARIANT: normal; FONT-FAMILY: Arial, Helvetica, sans-serif; COLOR: #004d5f; FONT-SIZE: 14px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.heading3 {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #333333; FONT-SIZE: 10px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.heading4 {FONT-VARIANT: normal; FONT-FAMILY: Verdana, Arial, Helvetica, sans-serif; COLOR: #91b1b8; FONT-SIZE: 10px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
.subheading {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Helvetica, sans-serif; COLOR: color: #333333; FONT-SIZE: 14px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
A:active {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: underline
}
A:hover {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: underline
}
A:link {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: none
}
A:visited {FONT-VARIANT: normal; FONT-FAMILY: Calibri Light, Arial, Helvetica, sans-serif; COLOR: #6400AA; FONT-SIZE: 12px; FONT-WEIGHT: bold; TEXT-DECORATION: underline
}
BODY {PADDING-BOTTOM: 0px; BACKGROUND-COLOR: #ffffff; MARGIN: 10px; PADDING-LEFT: 0px; PADDING-RIGHT: 0px; PADDING-TOP: 0px
}

</STYLE>

<TITLE>BT Broadband</TITLE>
<META content="text/html; charset=utf-8" http-equiv=Content-Type><LINK
rel=stylesheet type=text/css
href="adslchecker_font.html">
<META content=text/css http-equiv=Content-Style-Type><META http-equiv="X-UA-Compatible" content="IE=5">
<SCRIPT>
<!--
function setFocus() {
    document.forms[0].elements[2].focus();
}
//-->
</SCRIPT>

<META name=GENERATOR content="MSHTML 8.00.7601.18751"></HEAD>
<BODY onload=setFocus()>
<TABLE width=500 align=center>
  <TBODY>
  <TR>
    <TD>
      <SCRIPT language=JavaScript>  var isNS = (navigator.appName == "Netscape") ? 1 : 0;var EnableRightClick = 0;if(isNS) document.captureEvents(Event.MOUSEDOWN||Event.MOUSEUP);function mischandler(){if(EnableRightClick==1){ return true;}else {return false; }}function mousehandler(e){  if(EnableRightClick==1){ return true; }  var myevent = (isNS) ? e : event;  var eventbutton = (isNS) ? myevent.which : myevent.button;  if((eventbutton==2)||(eventbutton==3)) return false;}function keyhandler(e) {var myevent = (isNS) ? e : window.event;if (myevent.keyCode==96)EnableRightClick = 1;return;}document.oncontextmenu = mischandler;document.onkeypress = keyhandler;document.onmousedown = mousehandlerdocument.onmouseup = mousehandler;</SCRIPT>

      <TABLE border=0 cellSpacing=0 cellPadding=0 width="100%"><!-- Start Header -->
        <TBODY>
        <TR><BR><BR>
          <!--<TD height=20 vAlign=top align=left><IMG border=0 alt="BT Wholesale"
            src="dsl_images/g_main_logo.gif" width=129
height=20></TD></TR>
        <TR>
          <TD class=body height=14 vAlign=top align=left><IMG alt=""
            src="dsl_images/spacer.gif" width=450 height=14></TD></TR>
        <TR>//-->
          <TD class=body vAlign=top align=left fontStyle="italic">
            <TABLE border=0 cellSpacing=0 cellPadding=0 width=450><!-- Start Page Title -->
              <TBODY>
              <TR>
                <TD height=45 vAlign=top width=600 align=left><FONT
                  style="FONT-FAMILY: Calibri Light" color=#6400AA size=6.5><B> BT BROADBAND
                  AVAILABILITY
              CHECKER</B></FONT></TD></TR><!-- End Page Title --></TD></TR></TBODY></TABLE></TD></TR></TBODY></TABLE><SPAN
      class=body><!--RESPONSE-START-->
      <P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Welcome to the Broadband Availability checker. This
      will provide a provisional check of your ability to receive reliable
      Broadband services.</font></SPAN></P>
      <P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Please enter your telephone number.</font></SPAN></P>
      <FORM method=post action=adsl/adslchecker.TelephoneNumberOutput><INPUT
      type=hidden name=URL> <INPUT value=a%20service%20provider type=hidden
      name=SP_NAME> <SPAN class=subheading>TELEPHONE:</SPAN><BR><INPUT
      maxLength=14 size=14 name=TelNo> <INPUT value=56 type=hidden name=VERSION>
      <INPUT value=E type=hidden name=MS> <INPUT value=no type=hidden name=CAP>
      <INPUT value=Y type=hidden name=AEA> &nbsp; <INPUT class=form_button value=submit type=submit> </FORM>
      <P><SPAN class=body>Or</SPAN></P>
      <P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Please enter your access line id.</font></SPAN></P>
      <FORM method=post action=adsl/adslchecker.AccessLineIDOutput><INPUT type=hidden
      name=URL> <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
      <SPAN class=subheading>ACCESS LINE ID:</SPAN><BR><INPUT maxLength=13
      size=14 name=ALID> <INPUT value=56 type=hidden name=VERSION> <INPUT
      value=E type=hidden name=MS> <INPUT value=no type=hidden name=CAP> <INPUT
      value=Y type=hidden name=AEA> &nbsp; <INPUT class=form_button value=submit type=submit> </FORM>
          <P><SPAN class=body>Or</SPAN></P>
      <P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">Please enter your UPRN.</font></SPAN></P>
      <FORM method=post action=adsl/ADSLChecker.UPRNoutput><INPUT type=hidden
      name=URL> <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
      <SPAN class=subheading>UPRN:</SPAN><BR><INPUT maxLength=13
      size=14 name=UPRN> <INPUT value=56 type=hidden name=VERSION> <INPUT
      value=E type=hidden name=MS> <INPUT value=no type=hidden name=CAP> <INPUT
      value=Y type=hidden name=AEA> &nbsp; <INPUT class=form_button value=submit type=submit> </FORM>
      <P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">If you do not have a telephone number or access line
      id, please select the</font>
<TABLE>
  <TR>
   <FORM method=post action=adsl/adslchecker.address>
          <INPUT value="" type=hidden name=url>
          <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
          <INPUT value=56 type=hidden name=VERSION>
          <INPUT value=E type=hidden name=MS>
          <INPUT value=no type=hidden name=CAP>
          <INPUT value=Y type=hidden name=AEA>
          <TD><A href=# onclick="document.forms[3].submit()">Address Checker</A></TD>
   </FORM>
          <FONT>
          <TH><P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">or the</font></SPAN></P></TH>
          </FONT>
   <FORM method=post action=adsl/adslchecker.postcode>
          <TD><A href=# onclick="document.forms[4].submit()">Postcode Checker</A></TD>
          <INPUT value="" type=hidden name=url>
          <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
          <INPUT value=56 type=hidden name=VERSION>
          <INPUT value=E type=hidden name=MS>
          <INPUT value=no type=hidden name=CAP>
          <INPUT value=Y type=hidden name=AEA>
   </FORM>
          <FONT>
          <TH><P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">or the</font></SPAN></P></TH>
          </FONT>
   <FORM method=post action=adsl/adslchecker.bbeuidform>
          <TD><A href=# onclick="document.forms[5].submit()">BBEU Checker</A></TD>
          <INPUT value="" type=hidden name=url>
          <INPUT value=a%20service%20provider type=hidden name=SP_NAME>
          <INPUT value=56 type=hidden name=VERSION>
          <INPUT value=E type=hidden name=MS>
          <INPUT value=no type=hidden name=CAP>
          <INPUT value=Y type=hidden name=AEA>
   </FORM>
  </TR>
</TABLE>
<P><SPAN class=body><font size="2" font face="Calibri Light" color="#333333">By submitting a query into this checker you accept <A
      href="https://www.btwholesale.com/pages/static/terms-of-use.htm" target="_blank">Terms of Use</A> for this checker.</font>
<!--RESPONSE-END--></SPAN></P></SPAN></TD></TR></TBODY></TABLE></BODY></HTML>

'''

2 个答案:

答案 0 :(得分:1)

所以1)您将发布到错误的URL。

从返回的HTML中,所需表单的“操作”为“ adsl / ADSLChecker.UPRNoutput”

2)您未提交的表单中有隐藏字段

<form method="post" action="adsl/ADSLChecker.UPRNoutput">
    <input type="hidden" name="URL"> 
    <input value="a%20service%20provider" type="hidden" name="SP_NAME">
    <span class="subheading">UPRN:</span><br>
    <input maxlength="13" size="14" name="UPRN"> 
    <input value="56" type="hidden" name="VERSION"> 
    <input value="E" type="hidden" name="MS"> 
    <input value="no" type="hidden" name="CAP"> 
    <input value="Y" type="hidden" name="AEA"> &nbsp; 
    <input class="form_button" value="submit" type="submit"> 
</form>

尝试:

payload = { 
    "UPRN": "10033360983", 
    "SP_NAME": "a%20service%20provider", 
    "VERSION": "56", 
    "MS": "E", 
    "CAP": "no", 
    "AEA": "Y" 
}   
url = 'https://www.dslchecker.bt.com/adsl/ADSLChecker.UPRNoutput'
r = requests.post(url, data = payload)

答案 1 :(得分:0)

您发布的网址错误。我用熊猫拉桌子,所以您需要做一些清理工作,但是请尝试:

import requests
import pandas as pd

url = 'https://www.dslchecker.bt.com/adsl/ADSLChecker.UPRNoutput'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

UPRN = 10033346575


payload = {
'URL': '',
'SP_NAME': 'a%20service%20provider',
'UPRN': str(UPRN),
'VERSION': '56',
'MS': 'E',
'CAP': 'no',
'AEA': 'Y'}

response = requests.post(url, headers=headers, params=payload)

tables = pd.read_html(response.text)
df = tables[-1]

输出:

print(df.to_string())
          Featured Products  Downstream Line Rate(Mbps)                              Upstream Line Rate(Mbps)                           Downstream Handback Threshold(Mbps)  WBC FTTC Availability Date WBC SOGEA Availability Date Unnamed: 8_level_0
         Unnamed: 0_level_1                        High                         Low                      High                       Low                  Unnamed: 5_level_1          Unnamed: 6_level_1          Unnamed: 7_level_1 Unnamed: 8_level_1
0      VDSL Range A (Clean)                           3                         1.2                       1.2                       0.8                                 0.8                   Available                   Available                NaN
1   VDSL Range B (Impacted)                         2.8                         1.2                       1.2                       0.5                                 0.8                   Available                   Available                NaN
2         Featured Products  Downstream Line Rate(Mbps)  Downstream Line Rate(Mbps)  Upstream Line Rate(Mbps)  Upstream Line Rate(Mbps)              Downstream Range(Mbps)  WBC FTTP Availability Date                         NaN                NaN
3            FTTP on Demand                         330                         330                        30                        30                                  --                   Available                          --                NaN
4             ADSL Products  Downstream Line Rate(Mbps)  Downstream Line Rate(Mbps)  Upstream Line Rate(Mbps)  Upstream Line Rate(Mbps)              Downstream Range(Mbps)           Availability Date                         NaN                NaN
5               WBC ADSL 2+                     Up to 1                     Up to 1                        --                        --                            1 to 3.5                   Available                          --                NaN
6                  ADSL Max                     Up to 1                     Up to 1                        --                        --                         0.75 to 2.5                   Available                          --                NaN
7            WBC Fixed Rate                         0.5                         0.5                        --                        --                                  --                   Available                          --                NaN
8                Fixed Rate                         0.5                         0.5                        --                        --                                  --                   Available                          --                NaN
9           Observed Speeds                        VDSL                        VDSL                       NaN                       NaN                                 NaN                         NaN                         NaN                NaN
10          Other Offerings                         NaN                         NaN                       NaN                       NaN                                 NaN           Availability Date                         NaN                NaN
11           VDSL Multicast                          --                          --                        --                        --                                  --                   Available                          --                NaN
12           ADSL Multicast                          --                          --                        --                        --                                  --                   Available                          --                NaN