我想通过此网址使用BeautifulSoup收集代理机构电话号码:https://www.cv-library.co.uk/companies/agencies/0-9。
但问题是,我必须先点击一个链接到一个名为“contactDetails()”的javascript函数的链接来显示一个数字。我设法使用Selenium单击所有链接。但是我现在怎么收集这些数字呢?
那么,我现在该怎么做才能克服这个问题呢?
先谢谢。
注意:我是网络抓取新手。
import requests,bs4
from selenium import webdriver
site_url = "https://www.cv-library.co.uk/companies/agencies/0-9"
#---------------------------------- Opening Firefox with Selenium Webdrivre ---------------
#browser = webdriver.Firefox()
#I need my Firefox browser's current profile for a reason.
profile = webdriver.FirefoxProfile(r"C:\Users\USER\AppData\Roaming\Mozilla\Firefox\Profiles\i27jf7iw.default")
browser = webdriver.Firefox(firefox_profile=profile)
browser.get(site_url)
#---------------------------------- Clicking Phone Buttons ---------------------
phone_btn = browser.find_elements_by_link_text("Phone - Click to View")
for i in range(0,20):
phone_btn[i].click()
答案 0 :(得分:0)
点击所有按钮,稍后会获得数字。
但经过几次测试后我得到了“联系方式视图限制已达到”:)所以点击次数有限制。
from selenium import webdriver
site_url = "https://www.cv-library.co.uk/companies/agencies/0-9"
# --- Opening Firefox with Selenium Webdrivre ---
browser = webdriver.Firefox()
#I need my Firefox browser's current profile for a reason.
#profile = webdriver.FirefoxProfile(r"C:\Users\USER\AppData\Roaming\Mozilla\Firefox\Profiles\i27jf7iw.default")
#browser = webdriver.Firefox(firefox_profile=profile)
browser.get(site_url)
# --- Clicking Phone Buttons ---
phone_btn = browser.find_elements_by_link_text("Phone - Click to View")
for btn in phone_btn:
btn.click()
numbers = browser.find_elements_by_class_name('company-profile-phone')
for num in numbers:
print('number:', num.text)
没有Selenium
的版本
每个链接"Phone - click to View"
都有属性onclick
中的数字(即contactDetails( this, 154513 )
),JavaScript使用该数字来使用始终使用此数字的相同网址从服务器获取数字 - 即。 https://www.cv-library.co.uk/account-contact-details?id=154513
。
它工作了一段时间 - 可能我达到了点击的限制:)
import requests
import bs4
site_url = "https://www.cv-library.co.uk/companies/agencies/0-9"
phone_url = "https://www.cv-library.co.uk/account-contact-details?id="
session = requests.Session()
session.headers.update({
#"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Encoding": "gzip, deflate",
#"Accept-Language": "pl,en-US;q=0.7,en;q=0.3",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0"
})
print(session.headers)
r = session.get(site_url)
print(r.status_code)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
#print(r.text)
all_p = soup.find_all('p', class_='company-profile-phone')
for p in all_p:
number = p.a['onclick'][22:-2]
print('Phone ID:', number)
session.headers.update({
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'application/json, text/javascript, */*; q=0.01',
})
r = session.get(phone_url + number)
if r.status_code != 200:
print("Contact details view limit reached")
else:
data = r.json()
if "email" in data:
print('email:', data['email'])
if "phone" in data:
print('phone:', data['telephone'])
print('---')