我想从这个网站上获得建筑师的信息
https://www.sia.ch/en/membership/member-directory/m/207778/
特别是,我想提取有关姓名,地址,电话号码和电子邮件的信息。
这是我要尝试的操作,但是我无法提取此类信息。
我想要一个类似以下的输出:
person = ['Pierluigi A Marca', 'Sihlquai 244', '8005 Zürich', '+41 442734340', 'info@bamarch.ch']
import pandas as pd
from urllib import *
from bs4 import BeautifulSoup
from lxml import html
import requests
URL = 'https://www.sia.ch/en/membership/member-directory/m/207778/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='content')
print(results.prettify())
<div class="pagewidth clearfix" id="content">
<div class="textheader">
</div>
<ul class="headlineicon clearfix">
<li class="print">
<a href="javascript:print();">
</a>
</li>
<li class="bookmark">
<a class="addthis_button_favorites" href="javascript:;">
<span>
</span>
</a>
</li>
<li class="share">
<li class="mail_widget">
<a class="addthis_button_email">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<li class="googleplus">
<a class="addthis_button_google_plusone_share">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<li class="twitter">
<a class="addthis_button_twitter">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<li class="facebook">
<a class="addthis_button_facebook">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<script type="text/javascript">
var addthis_config = { data_track_clickback: false }
</script>
</li>
</ul>
<div class="clearfix spec-height-theme">
<div class="narrowcolumnLeft">
<ul class="clearfix" id="subNavigation">
<li>
<a href="/en/membership/membership/" onfocus="blurLink(this);">
membership
</a>
<span>
</span>
</li>
<li class="active">
<a href="/en/membership/member-directory/" onfocus="blurLink(this);">
member directory
</a>
<span>
</span>
<ul>
<li>
<a href="/en/membership/member-directory/honorary-members/" onfocus="blurLink(this);">
honorary members
</a>
</li>
<li>
<a href="/en/membership/member-directory/individual-members/" onfocus="blurLink(this);">
individual members
</a>
</li>
<li>
<a href="/en/membership/member-directory/corporate-members/" onfocus="blurLink(this);">
corporate members
</a>
</li>
<li>
<a href="/en/membership/member-directory/student-members/" onfocus="blurLink(this);">
student members
</a>
</li>
<li>
<a href="/en/membership/member-directory/partner/" onfocus="blurLink(this);">
partner
</a>
</li>
</ul>
</li>
</ul>
</div>
<div class="widecolumn">
<!--TYPO3SEARCH_begin-->
<div class="csc-default" id="c303">
<div class="tx-updsiafeuseradmin-pi1">
<div class="tx-updsiafeuseradmin-pi1-singleView">
<div class="secr" data-secr="09d93fcfd5cf0f0b68e11bba96f6312c4023c72d">
</div>
<h1 class="mitgliederprofil">
Individual Member
</h1>
<table>
<tr>
<th colspan="2" valign="top">
Address
</th>
</tr>
<tr>
<td colspan="2" valign="top">
<!-- -->
<!--Dipl. Arch. ETH/SIA<br />-->
Mr
<br/>
Pierluigi A Marca
<br/>
Dipl. Arch. ETH/SIA
<br/>
Sihlquai 244
<br/>
8005 Zürich
<br/>
</td>
</tr>
<tr>
<th colspan="2" valign="top">
Contact
</th>
</tr>
<tr>
<td class="col1" valign="top">
Telephone number
<br/>
E-mail
<br/>
</td>
<td valign="top">
<div class="contact-data" data-contact="ggFeglggKF42DCpZz2iOI3EgcsZxN14vIYlhSGFLtORrpHZtgSiJ8tWDNuNxus03JD60nZu+g1FVPIdMiCp/bZMsSL45/+3xu9MMEZLnhH/Y67evbMdMICVsZaULHgIpA+S50ZdTg3glRtCa9CTX/zfXOfgyDaarW44HMYeW6pTMqImejlSubQXjCiPKzS0jgiZHBGspcnBZW/99X0ORYNaEUvOkjJDmozv9yld9A1x4jdyXAqHoDMMx0IICMsJiWcKADTFWKfI0OHHORhv7kvVW3KtbnX5PJjyilH0=">
needs javascript
</div>
</td>
</tr>
<tr>
<th colspan="2" valign="top">
Details
</th>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Profession
</td>
<td valign="top">
Diploma in Architecture
<br/>
</td>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Area of activity
<br/>
</td>
<td valign="top">
Architecture
<br/>
</td>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Professional group
</td>
<td valign="top">
Architecture
</td>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Section
</td>
<td valign="top">
Zurich
<br/>
</td>
</tr>
<tr>
<td colspan="2" valign="top">
</td>
</tr>
</table>
<!--<div class="tx-updsiafeuseradmin-pi1-singleView-footer lightbox-close-link"><a href="javascript:;">Close</a></div>-->
<div class="tx-updsiafeuseradmin-pi1-singleView-footer" style="display:none;">
<span>
</span>
<a href="javascript:history.back()">
back to results list
</a>
</div>
<script type="text/javascript">
jQuery(document).ready(function() {
if (document.referrer.split( "/" )[2] == "www.sia.ch") {
jQuery(".tx-updsiafeuseradmin-pi1-singleView-footer").show();
}
});
</script>
</div>
</div>
</div>
<!--TYPO3SEARCH_end-->
</div>
</div>
</div>
答案 0 :(得分:1)
您将必须使用Selenium来允许javascript呈现某些细节。然后,您需要进行一些操作。这就是您的名字,其中包含个人的头衔('Mr.'
)
import pandas as pd
from selenium import webdriver
url = 'https://www.sia.ch/en/membership/member-directory/m/207778/'
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
html = driver.page_source
html = str(html).replace('<br />', '::')
df = pd.read_html(html)[0].iloc[[0,2],1]
contact = []
for x in df.tolist():
#x = df.tolist()[0]
alpha = x.split('::')
alpha = [ a.strip() for a in alpha if a != '' ]
contact.append(alpha)
contact = contact[0] + contact[1]
driver.close()
输出:
print (contact)
['Mr', 'Pierluigi A Marca', 'Dipl. Arch. ETH/SIA', 'Sihlquai 244', '8005 Zürich', '+41 442734340', 'info@bamarch.ch']
答案 1 :(得分:0)
您可以不使用硒。我不会提供代码(由于法律原因)如何进行解码,但是在此请注意如何执行此操作:
// init hide contact
jQuery(".contact-data").html(Aes.Ctr.decrypt(
jQuery(".contact-data").data("contact"),
jQuery(".secr").data("secr"), 256));
});
//div[@class='contact-data']/@data-contact
和es-key在这里://div[@class='secr']/@data-secr
每个请求都会生成密钥。
祝你好运!