如何获得'href'

时间:2017-03-17 13:35:50

标签: python python-3.x web-scraping beautifulsoup python-requests

我正在尝试制作一个程序,可以从我的大学网站上获取有关我出席的信息。为了做到这一点,我写了一个脚本来登录网站,这导致我到我的仪表板,然后转到Attendence标签,获取href并将其附加到url大学网站,
attendence类中的标记看起来像这样

<a href="../Student/StudentAttendanceView.aspx?SID=7JyKkZE1Eyx2EYNii7tOjQ==|yaE7DmfR9r8=" id="aAttandance">Attendance</a>

当我点击出勤链接时,网页上的地址栏上显示url

http://erp.college_name.edu/Student/StudentAttendanceView.aspx?SID=7JyKkZE1Eyx2EYNii7tOjQ==|yaE7DmfR9r8=

所以,我应该将href附加到

,这是自我解释的

'http://erp.college_name.edu'。好的,我做了

 L = 'http://erp.college_name.edu' + str(I.findAll('li')[4].a.get('href').replace('.', ''))

但问题是,当我获取href时,它不是标签中的内容,它会不断变化,当我得到链接时,我打印L我得到了这个..我假设得到..

http://erp.college_name.edu/Student/StudentAttendanceViewaspx?SID=aDmK9cEFWwDqvsWw5ZzEOw==|oTeYVRfW1u8=

但问题是我进入的href与真正的url有所不同,而且当我第二次重新启动计划时,我会继续改变,这是我第二次

http://erp.college_name.edu/Student/StudentAttendanceViewaspx?SID=WM/lbVRchyyBiLsDvkORJw==|MaP8NtvvrHE=

,为什么我得到这个,而且当我点击我的仪表板页面上的其他链接并再次点击考勤选项卡时,href中的url值再次在地址栏中更改? ..

所以,在那之后我做了,

opens = requests.get(L)
soup_2 = BeautifulSoup(opens.text, 'lxml')
print(L)  
我得到了这个......

    C:\Users\HUNTER\AppData\Local\Programs\Python\Python35-32\python.exe 
    C:/Users/HUNTER/PycharmProjects/dictionary/erp_1.py
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
   "http://www.w3.org/TR/html4/strict.dtd">
  <html><head><title>The page cannot be found</title>
   <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <style type="text/css">
    BODY { font: 8pt/12pt verdana }
    H1 { font: 13pt/15pt verdana }
    H2 { font: 8pt/12pt verdana }
   A:link { color: red }
    A:visited { color: maroon }
 </style>
 </head><body><table border="0" cellspacing="10" width="500"><tr><td>
  <h1>The page cannot be found</h1>
  The page you are looking for might have been removed, had its name 
 changed, or is temporarily unavailable.
 <hr/>
 <p>Please try the following:</p>
 <ul>
  <li>Make sure that the Web site address displayed in the address bar of 
your browser is spelled and formatted correctly.</li>
  <li>If you reached this page by clicking a link, contact
    the Web site administrator to alert them that the link is incorrectly 
   formatted.
    </li>
    <li>Click the <a href="javascript:history.back(1)">Back</a> button to 
 try 
   another link.</li>
     </ul>
       <h2>HTTP Error 404 - File or directory not found.<br/>Internet 
    Information 
   Services (IIS)</h2>
<hr/>
 <p>Technical Information (for support personnel)</p>
 <ul>
     <li>Go to <a href="http://go.microsoft.com/fwlink/?
     linkid=8180">Microsoft 
       Product Support Services</a> and perform a title search for the words 
    <b>HTTP</b> and <b>404</b>.</li>
  <li>Open <b>IIS Help</b>, which is accessible in IIS Manager (inetmgr),
  and search for topics titled <b>Web Site Setup</b>, <b>Common 
   Administrative 
  Tasks</b>, and <b>About Custom Error Messages</b>.</li>
   </ul>
    </td></tr></table></body></html>


  Process finished with exit code 0

更新

我将.replace('.', '')方法替换为[2:],因为替换函数还从.中的.aspx删除了href,现在问题已更改为{ {3}}

但是,href的价值如何不断改变我如何获取该页面..

任何帮助?

0 个答案:

没有答案