我希望从this页面中提取姓名,电话号码和电子邮件地址。
代码有效,但问题是其中一些名称在其“卡片”中有多个链接,因此当我进行提取时,它会抛出所引发链接的整个年表...例如:
Julio(7月)Anopol ,,,手机:416-678-2916,mailto:julio.luis.anopol@freedom55financial.com Henry D. Arauag,办公室:905-276-1177,分机594,手机:647-649-7955,mailto:henry.arauag@freedom55financial.com
Rick Auckbaraullee,办公室:905-276-1177,分机557,手机:416-577-2377,mailto:rick.auckbaraullee@freedom55financial.com
Frank Basile,办公室:905-276-1177,分机469,手机:416-797-9316,mailto:frank.basile@freedom55financial.com
Janis Bellman,办公室:905-276-1177,分机601,手机:416-258-0630,https://www.linkedin.com/in/janisbellman
Sean Beneteau,办公室:905-363-5800,分机123 ,, https://www.facebook.com/MyBellman/
Carmen Briguglio,办公室:905-824-5660 ,,, https://twitter.com/BellmanJanis
Qi Jun(Steve)Cai,办公室:905-276-1177,分机591,手机:416-949-1069,mailto:janis.bellman@freedom55financial.com
正如您所看到的那样,如果该名称的“卡片”附加了另一个链接,则该序列将被删除
这是我的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# example option: add 'incognito' command line arg to options
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
# create new instance of chrome in incognito mode
browser = webdriver.Chrome(executable_path='/Library/Application Support/Google/chromedriver', chrome_options=option)
# go to website
browser.get("https://www.freedom55financial.com/ff/advisor/Ontario/Mississauga")
browser.implicitly_wait(4)
# extract names from parent element
all_names = browser.find_elements_by_xpath('//*[@id="advisor-results"]/article[*]/section/h2')
# extract all phone numbers
all_off_phones_numbers = browser.find_elements_by_xpath('//*[@id="advisor-results"]/article[*]/section/p[4]/a')
#extract all exts
all_exts = browser.find_elements_by_xpath('//*[@id="advisor-results"]/article[*]/section/p[5]')
#extract all cell numbers
all_cell_numbers = browser.find_elements_by_xpath('//*[@id="advisor-results"]/article[*]/section/p[6]/a')
#extract all email addys
all_emails = browser.find_elements_by_xpath('//*[@id="advisor-results"]/article[*]/footer/a[*]')
# print out all info
num_page_items = len(all_names)
for i in range(num_page_items):
print(all_names[i].text + " , " + all_off_phones_numbers[i].text + " , " + all_exts[i].text + " , " + all_cell_numbers[i].text + " , " + all_emails[i].get_attribute('href'))
# print(all_names[i].text + " , " + all_off_phones_numbers[i].text + " , " + all_exts[i].text + " , " + all_cell_numbers[i].text + " , " + all_emails[i].text)
browser.close()
显示信息如何包含的页面中HTML CODE的一个示例:
<section id="advisor-results" class="advisor-results" role="region" aria-live="polite" >
<article class="advisor-results__advisor-card f55f English Portugese CIM Male Photo_Yes" aria-describedby="
f55-security-advisor-legend
">
<div class="advisor-image">
<img src="
/dsms/wcm/connect/ff/0a7d8558-3a70-4c24-8d10-
ff779a87d2e2/Amaral_Marcos_2016.jpg?MOD=AJPERES&CACHEID=0a7d8558-3a70-4c24-8d10-ff779a87d2e2
" alt="" /> </div>
<section class="advisor-details">
<h2>Marcus (Marcos) Amaral</h2>
<p class="advisor-credentials">CIM</p>
<p class="advisor-firm">
Freedom 55 Financial
</p>
<p class="advisor-offerings"></p>
<a href="http://maps.google.com/?q=1 City Centre Dr., Mississauga, ON, , L5B 1M2" data-card='map' target="_blank">
<address role="presentation">
<span class="address-line">1 City Centre Dr.</span>
<span class="address-line">Suite 1600</span>
<span class="address-line">Mississauga, ON</span>
<span class="address-line">L5B 1M2</span>
</address>
</a>
<p><a href="tel:905-276-1177" data-card="phone" >Office: 905-276-1177</a></p>
<p>Ext. 485</p>
<p><a href="tel:519-819-3241" data-card="phone" >Mobile: 519-819-3241</a></p>
</section>
<footer>
<a title='' data-card='email' target=''
href='mailto:marcos.amaral@freedom55financial.com' ><i class='fa fa-
envelope-o' aria-label='Contact this advisor by email'></i></a>
</footer>
]</article>
我尝试了各种变体,通过css选择器查找,xpath包含文本等,但无济于事。
如何才能收到电子邮件?
提前致谢。