使用BeautifulSoup从javascript函数中收集返回值

时间:2017-12-17 19:26:26

标签: javascript python web-scraping beautifulsoup python-requests

我想通过此网址使用BeautifulSoup收集代理机构电话号码:https://www.cv-library.co.uk/companies/agencies/0-9

但问题是,我必须先点击一个链接到一个名为“contactDetails()”的javascript函数的链接来显示一个数字。我设法使用Selenium单击所有链接。但是我现在怎么收集这些数字呢?

那么,我现在该怎么做才能克服这个问题呢?

先谢谢。

注意:我是网络抓取新手。

import requests,bs4
from selenium import webdriver

site_url = "https://www.cv-library.co.uk/companies/agencies/0-9"

#---------------------------------- Opening Firefox with Selenium Webdrivre ---------------
#browser = webdriver.Firefox() 
#I need my Firefox browser's current profile for a reason.
profile = webdriver.FirefoxProfile(r"C:\Users\USER\AppData\Roaming\Mozilla\Firefox\Profiles\i27jf7iw.default")
browser = webdriver.Firefox(firefox_profile=profile)
browser.get(site_url)

#---------------------------------- Clicking Phone Buttons ---------------------
phone_btn = browser.find_elements_by_link_text("Phone - Click to View")
for i in range(0,20):
    phone_btn[i].click()

1 个答案:

答案 0 :(得分:0)

点击所有按钮,稍后会获得数字。

但经过几次测试后我得到了“联系方式视图限制已达到”:)所以点击次数有限制。

from selenium import webdriver

site_url = "https://www.cv-library.co.uk/companies/agencies/0-9"

# --- Opening Firefox with Selenium Webdrivre ---

browser = webdriver.Firefox() 

#I need my Firefox browser's current profile for a reason.
#profile = webdriver.FirefoxProfile(r"C:\Users\USER\AppData\Roaming\Mozilla\Firefox\Profiles\i27jf7iw.default")
#browser = webdriver.Firefox(firefox_profile=profile)

browser.get(site_url)

# --- Clicking Phone Buttons ---

phone_btn = browser.find_elements_by_link_text("Phone - Click to View")
for btn in phone_btn:
    btn.click()

numbers = browser.find_elements_by_class_name('company-profile-phone')
for num in numbers:
    print('number:', num.text)

没有Selenium的版本

每个链接"Phone - click to View"都有属性onclick中的数字(即contactDetails( this, 154513 )),JavaScript使用该数字来使用始终使用此数字的相同网址从服务器获取数字 - 即。 https://www.cv-library.co.uk/account-contact-details?id=1545‌​13

它工作了一段时间 - 可能我达到了点击的限制:​​)

import requests
import bs4

site_url = "https://www.cv-library.co.uk/companies/agencies/0-9"
phone_url = "https://www.cv-library.co.uk/account-contact-details?id="

session = requests.Session()
session.headers.update({
    #"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
#    "Accept-Encoding": "gzip, deflate", 
    #"Accept-Language": "pl,en-US;q=0.7,en;q=0.3", 
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0"
})
print(session.headers)

r = session.get(site_url)
print(r.status_code)
soup = bs4.BeautifulSoup(r.text, 'html.parser')

#print(r.text)
all_p = soup.find_all('p', class_='company-profile-phone')

for p in all_p:
    number = p.a['onclick'][22:-2]
    print('Phone ID:', number)

    session.headers.update({
        'X-Requested-With': 'XMLHttpRequest',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
    })

    r = session.get(phone_url + number)

    if r.status_code != 200:
        print("Contact details view limit reached")
    else:
        data = r.json()

        if "email" in data:
            print('email:', data['email'])
        if "phone" in data:
            print('phone:', data['telephone'])
    print('---')