我试图从这个网站上抓取信息https://www.heiminfo.ch/institutionen
HTML 看起来像这样,其中存储了我要查找的信息。
<article class="institution card pushed" data-name="HOF SPEICHER AG - (di Gallo)" data-institution-type="HIALTER HIEB CVAPPENZELLALTER" data-subscription="SILBER" data-zoom="15" data-track-content="" data-content-target="Huta5R8" data-lng="9.441113" data-group="Kurt di Gallo Holding AG" data-content-piece="Huta5R8" data-content-name="Institution View List" data-lat="47.41353" style="height: 249.95px;" data-ol-has-click-handler="">
<a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">
<div class="img-container">
<img class=" lazyloaded" width="450" src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" data-src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" alt="HOF SPEICHER AG">
</div>
<div class="text-container" style="height: 114.99px;">
<div class="name-and-addition">
<h2 style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">HOF SPEICHER AG </font></font></h2>
<p class="addition" style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">(di Gallo)</font></font></p>
</div>
<p class="location">
<span class="canton"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">AR </font></font></span>
<span class="plz"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">9042 </font></font></span>
<span class="city"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">memory</font></font></span>
</p>
</div>
</a>
</article>
我已经能够获得前 500 个机构名称、城市、plz、位置。 使用此代码:由 Arundeep Chohan
提供 import requests
import time
import pandas as pd
import csv
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from selenium import webdriver as wb
driver=wb.Chrome('chromedriver.exe')
driver.maximize_window()
driver.get(' https://www.heiminfo.ch/institutionen')
button=driver.find_element_by_xpath('/html/body/div[1]/main/div/section/form/div[1]/div[3]/div/button[3]').click();
wait=WebDriverWait(driver, 5)
total=500
h=[]
while True:
try:
soup=BeautifulSoup(driver.page_source, 'html.parser')
item=soup.find(class_='institutions')
#item=driver.find_element_by_class_name('institutions')
lsh=item.find_all(class_="name-and-addition")
#lsh=item.find_element_by_class_name('name-and-addition')
if(len(lsh)>=total):
for e in lsh[:total]:
h(e.text.strip)
data=pd.DataFrame(zip(h), columns=['Adult Homes'])
print(data)
break
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".next.btn"))).click()
time.sleep(5)
except Exception as e:
print(e)
break
剩下的信息是隐藏在标签“<a> href=
”中的电话号码,我必须点击它才能打开以获取电话号码。这些“<a> href=
”的总数是 1589。
我如何编写一个刮板来遍历所有这些链接并获取隐藏的电话号码?链接如下所示:
[<a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">][1]