Python:如何从网站上抓取信息?

时间:2020-01-22 10:30:22

标签: python html web-scraping beautifulsoup

我想从这个网站上获得建筑师的信息

https://www.sia.ch/en/membership/member-directory/m/207778/

特别是,我想提取有关姓名,地址,电话号码和电子邮件的信息。

这是我要尝试的操作,但是我无法提取此类信息。

我想要一个类似以下的输出:

person = ['Pierluigi A Marca', 'Sihlquai 244', '8005 Zürich', '+41 442734340', 'info@bamarch.ch']

import pandas as pd
from urllib import *
from bs4 import BeautifulSoup
from lxml import html
import requests

URL = 'https://www.sia.ch/en/membership/member-directory/m/207778/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='content')
print(results.prettify())


<div class="pagewidth clearfix" id="content">
 <div class="textheader">
 </div>
 <ul class="headlineicon clearfix">
  <li class="print">
   <a href="javascript:print();">
   </a>
  </li>
  <li class="bookmark">
   <a class="addthis_button_favorites" href="javascript:;">
    <span>
    </span>
   </a>
  </li>
  <li class="share">
   <li class="mail_widget">
    <a class="addthis_button_email">
     <img alt="" src="/fileadmin/templates/img/transp.gif"/>
    </a>
   </li>
   <li class="googleplus">
    <a class="addthis_button_google_plusone_share">
     <img alt="" src="/fileadmin/templates/img/transp.gif"/>
    </a>
   </li>
   <li class="twitter">
    <a class="addthis_button_twitter">
     <img alt="" src="/fileadmin/templates/img/transp.gif"/>
    </a>
   </li>
   <li class="facebook">
    <a class="addthis_button_facebook">
     <img alt="" src="/fileadmin/templates/img/transp.gif"/>
    </a>
   </li>
   <script type="text/javascript">
    var addthis_config = { data_track_clickback: false }
   </script>
  </li>
 </ul>
 <div class="clearfix spec-height-theme">
  <div class="narrowcolumnLeft">
   <ul class="clearfix" id="subNavigation">
    <li>
     <a href="/en/membership/membership/" onfocus="blurLink(this);">
      membership
     </a>
     <span>
     </span>
    </li>
    <li class="active">
     <a href="/en/membership/member-directory/" onfocus="blurLink(this);">
      member directory
     </a>
     <span>
     </span>
     <ul>
      <li>
       <a href="/en/membership/member-directory/honorary-members/" onfocus="blurLink(this);">
        honorary members
       </a>
      </li>
      <li>
       <a href="/en/membership/member-directory/individual-members/" onfocus="blurLink(this);">
        individual members
       </a>
      </li>
      <li>
       <a href="/en/membership/member-directory/corporate-members/" onfocus="blurLink(this);">
        corporate members
       </a>
      </li>
      <li>
       <a href="/en/membership/member-directory/student-members/" onfocus="blurLink(this);">
        student members
       </a>
      </li>
      <li>
       <a href="/en/membership/member-directory/partner/" onfocus="blurLink(this);">
        partner
       </a>
      </li>
     </ul>
    </li>
   </ul>
  </div>
  <div class="widecolumn">
   <!--TYPO3SEARCH_begin-->
   <div class="csc-default" id="c303">
    <div class="tx-updsiafeuseradmin-pi1">
     <div class="tx-updsiafeuseradmin-pi1-singleView">
      <div class="secr" data-secr="09d93fcfd5cf0f0b68e11bba96f6312c4023c72d">
      </div>
      <h1 class="mitgliederprofil">
       Individual Member
      </h1>
      <table>
       <tr>
        <th colspan="2" valign="top">
         Address
        </th>
       </tr>
       <tr>
        <td colspan="2" valign="top">
         <!-- -->
         <!--Dipl. Arch. ETH/SIA<br />-->
         Mr
         <br/>
         Pierluigi A Marca
         <br/>
         Dipl. Arch. ETH/SIA
         <br/>
         Sihlquai 244
         <br/>
         8005 Zürich
         <br/>
        </td>
       </tr>
       <tr>
        <th colspan="2" valign="top">
         Contact
        </th>
       </tr>
       <tr>
        <td class="col1" valign="top">
         Telephone number
         <br/>
         E-mail
         <br/>
        </td>
        <td valign="top">
         <div class="contact-data" data-contact="ggFeglggKF42DCpZz2iOI3EgcsZxN14vIYlhSGFLtORrpHZtgSiJ8tWDNuNxus03JD60nZu+g1FVPIdMiCp/bZMsSL45/+3xu9MMEZLnhH/Y67evbMdMICVsZaULHgIpA+S50ZdTg3glRtCa9CTX/zfXOfgyDaarW44HMYeW6pTMqImejlSubQXjCiPKzS0jgiZHBGspcnBZW/99X0ORYNaEUvOkjJDmozv9yld9A1x4jdyXAqHoDMMx0IICMsJiWcKADTFWKfI0OHHORhv7kvVW3KtbnX5PJjyilH0=">
          needs javascript
         </div>
        </td>
       </tr>
       <tr>
        <th colspan="2" valign="top">
         Details
        </th>
       </tr>
       <tr>
        <td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
         Profession
        </td>
        <td valign="top">
         Diploma in Architecture
         <br/>
        </td>
       </tr>
       <tr>
        <td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
         Area of activity
         <br/>
        </td>
        <td valign="top">
         Architecture
         <br/>
        </td>
       </tr>
       <tr>
        <td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
         Professional group
        </td>
        <td valign="top">
         Architecture
        </td>
       </tr>
       <tr>
        <td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
         Section
        </td>
        <td valign="top">
         Zurich
         <br/>
        </td>
       </tr>
       <tr>
        <td colspan="2" valign="top">
        </td>
       </tr>
      </table>
      <!--<div class="tx-updsiafeuseradmin-pi1-singleView-footer lightbox-close-link"><a href="javascript:;">Close</a></div>-->
      <div class="tx-updsiafeuseradmin-pi1-singleView-footer" style="display:none;">
       <span>
       </span>
       <a href="javascript:history.back()">
        back to results list
       </a>
      </div>
      <script type="text/javascript">
       jQuery(document).ready(function() {
                if (document.referrer.split( "/" )[2] == "www.sia.ch") {
                    jQuery(".tx-updsiafeuseradmin-pi1-singleView-footer").show();
                }
                });
      </script>
     </div>
    </div>
   </div>
   <!--TYPO3SEARCH_end-->
  </div>
 </div>
</div>

2 个答案:

答案 0 :(得分:1)

您将必须使用Selenium来允许javascript呈现某些细节。然后,您需要进行一些操作。这就是您的名字,其中包含个人的头衔('Mr.'

import pandas as pd
from selenium import webdriver

url = 'https://www.sia.ch/en/membership/member-directory/m/207778/'
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)

html = driver.page_source
html = str(html).replace('<br />', '::')
df = pd.read_html(html)[0].iloc[[0,2],1]

contact = []
for x in df.tolist():
    #x = df.tolist()[0]
    alpha = x.split('::')
    alpha = [ a.strip() for a in alpha if a != '' ]
    contact.append(alpha)

contact = contact[0] + contact[1]
driver.close()

输出:

print (contact)
['Mr', 'Pierluigi A Marca', 'Dipl. Arch. ETH/SIA', 'Sihlquai 244', '8005 Zürich', '+41 442734340', 'info@bamarch.ch']

答案 1 :(得分:0)

您可以不使用硒。我不会提供代码(由于法律原因)如何进行解码,但是在此请注意如何执行此操作:

  1. 解密js代码为:
        // init hide contact
        jQuery(".contact-data").html(Aes.Ctr.decrypt(
        jQuery(".contact-data").data("contact"),
        jQuery(".secr").data("secr"), 256));
    });
  1. 您可以看到它是aes-ctr。编码的字符串在这里: //div[@class='contact-data']/@data-contact和es-key在这里://div[@class='secr']/@data-secr

每个请求都会生成密钥。

祝你好运!