您好我正试图从以下网站https://www.unpri.org/directory/抓取公司链接。但是我的代码保持返回None而不是href,这是我的代码。我试着在这里搜索,但似乎找不到其他人有同样的问题。
这是我的orignial代码
from splinter import Browser
import bs4 as bs
import os
import time
import csv
url = 'https://www.unpri.org/directory/'
path = os.getcwd() + "/chromedriver"
executable_path = {'executable_path': path}
browser = Browser('chrome', **executable_path)
browser.visit(url)
source = browser.html
soup = bs.BeautifulSoup(source,'lxml')
for url in soup.find_all('div',class_="col-xs-8 col-md-9"):
print(url.get('href', None))
答案 0 :(得分:0)
想法是点击“显示更多”直到显示所有链接,然后只收集链接。
我使用Selenium编写此脚本来单击所有三个按钮,直到显示所有链接。然后,它将整页html保存到名为page_source.html
。
然后使用BeautifulSoup解析html,保存到dict({org_name: url}
),然后转储到名为organisations.json
的json文件。
import json
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import ElementNotVisibleException
def click_button_until_all_displayed(browser, button_id):
button = browser.find_element_by_id(button_id)
while True:
try:
button.click()
except ElementNotVisibleException:
break
sleep(1.2)
BASE_URL = 'https://www.unpri.org'
driver = webdriver.Chrome()
driver.get('{}/directory'.format(BASE_URL))
for button_name in ('asset', 'invest', 'services'):
click_button_until_all_displayed(driver, 'see_all_{}'.format(button_name))
with open('page_source.html', 'w') as f:
f.write(driver.page_source)
driver.close()
with open('page_source.html', 'r') as f:
soup = BeautifulSoup(f, 'lxml')
orgs = {}
for div in soup.find_all('div', class_="col-xs-8 col-md-9"):
org_name = div.h5.a.text.strip()
orgs[org_name] = '{}{}'.format(BASE_URL, div.h5.a['href'])
with open('organisations.json', 'w') as f:
json.dump(orgs, f, indent=2)
显示所有链接只需不到4分钟。如果您想节省一些时间,可以使用link to the gist来显示此源代码page_source.html
和organisations.json
。