使用BeautifulSoup进行网络抓取嵌套的div]

时间:2018-11-06 23:34:59

标签: python web-scraping beautifulsoup

我想抓取以下网站的内容:

https://www.morningstar.com/stocks/xnys/mmm/quote.html

在这里,我想单击执行,然后单击董事会,然后我要从刮下传记每位董事的>个人资料。理想情况下,最终结果将是董事会12位成员各自的传记。 Board of Directors Member Profile

我正在尝试使用BeautifulSoup来做到这一点,但我无法到达该嵌套的div。

from bs4 import BeautifulSoup as soup
import re, time
import csv
from selenium import webdriver
def get_directors(_html):
  _names = [i.text for i in soup(_html, 'html.parser').find_all('div', {'class':'name ng-binding'})]
  return _names[_names.index('Compensation for all Key Executives')+1:-1]

_board = {}
d = webdriver.Chrome('/Users/tS0u/Downloads/chromedriver')
d.get('https://www.morningstar.com/stocks/xnys/mmm/quote.html')
time.sleep(5)
_exec = d.find_elements_by_class_name("mds-button")
_exec[8].click()
time.sleep(3)
d.find_element_by_link_text("Board of Directors").click()
time.sleep(3)
full_directors = d.find_elements_by_class_name('person-row')[19:31]
for _name, _link in zip(get_directors(d.page_source), full_directors):
   _link.click()
   time.sleep(3)
   d.find_element_by_link_text("Profile").click()
   time.sleep(3)
   _board[_name] = soup(d.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
   _link.click()
   time.sleep(3)
   print(_board)
   with open('filename.csv', 'w') as f:
      write = csv.writer(f)
      write.writerows([['name', 'biography'], *map(list, _board.items())])

使用硒并遵循@ Ajax1234会出现以下错误。

Traceback (most recent call last):
File "/Users/tS0u/Desktop/morningstar_stackoverflowanswer.py", line 21, in <module>
d.find_element_by_link_text("Profile").click()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: 
Element <a href="#" ng-click="subTab.tabSelect(tabItem, $event, item)" 
data-linkbinding="profile" class="ng-binding" label- 
short="...">Profile</a> is not clickable at point (57, 595). Other 
element would receive the click: <div id="_evidon_banner" 
class="evidon-banner" style="position: fixed; display: flex; align- 
items: center; width: 100%; background: rgb(239, 239, 239); font-size: 
14px; color: rgb(0, 0, 0); z-index: 2147000001; padding: 10px 0px; 
font-family: UniversNextMorningStarW04, Arial, Helvetica, sans-serif; 
border-top: 2px solid rgb(153, 153, 153); bottom: 0px;">...</div>
(Session info: chrome=70.0.3538.77)
(Driver info: chromedriver=2.43.600229 
(3fae4d0cda5334b4f533bede5a4787f7b832d052),platform=Mac OS X 10.12.6 x86_64)

尝试在csv中导出时的错误

Traceback (most recent call last):
File "/Users/tS0u/Desktop/morningstar_stackoverflowanswer.py", line 22, in <module>
d.find_element_by_link_text("Profile").click()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 80, in click
self._execute(Command.CLICK_ELEMENT)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webelement.py", line 628, in _execute
return self._parent.execute(command, params)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: 
Element <a href="#" ng-click="subTab.tabSelect(tabItem, $event, item)" 
data-linkbinding="profile" class="ng-binding" label- 
short="...">Profile</a> is not clickable at point (57, 595). Other 
element would receive the click: <div id="_evidon_banner" 
class="evidon-banner" style="position: fixed; display: flex; align- 
items: center; width: 100%; background: rgb(239, 239, 239); font-size: 
14px; color: rgb(0, 0, 0); z-index: 2147000001; padding: 10px 0px; 
font-family: UniversNextMorningStarW04, Arial, Helvetica, sans-serif; 
border-top: 2px solid rgb(153, 153, 153); bottom: 0px;">...</div>

无论哪种方式,我都非常感谢您花费时间解决我的问题。

1 个答案:

答案 0 :(得分:0)

网站是动态的,因此,您将不得不使用浏览器操作工具,例如selenium

from bs4 import BeautifulSoup as soup
import re, time
from selenium import webdriver
def get_directors(_html):
  _names = [i.text for i in soup(_html, 'html.parser').find_all('div', {'class':'name ng-binding'})]
  return _names[_names.index('Compensation for all Key Executives')+1:-1]

_board = {}
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.morningstar.com/stocks/xnys/mmm/quote.html')
time.sleep(5)
_exec = d.find_elements_by_class_name("mds-button")
_exec[8].click()
time.sleep(3)
d.find_element_by_link_text("Board of Directors").click()
time.sleep(3)
full_directors = d.find_elements_by_class_name('person-row')[19:31]
for _name, _link in zip(get_directors(d.page_source), full_directors):
   _link.click()
   time.sleep(3)
   d.find_element_by_link_text("Profile").click()
   time.sleep(3)
   _board[_name] = soup(d.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
   _link.click()
   time.sleep(3)

print(_board)

输出(缩短以节省空间):

{'Inge G. Thulin': '\nBiography\n\n                Mr. Thulin is the Chairman of the Board, President and Chief Executive Officer of 3M Company. Mr. Thulin served as President and Chief Executive Officer of 3M Company from ....', 'Sondra L. Barbour': '\nBiography\n\n                Ms. Barbour is Executive Vice President, Information Systems and Global Solutions, Lockheed Martin Corporation, a high technology aerospace and defense company. Since joini....', 'Thomas K. Brown': '\nBiography\n\n                Mr. Brown is the Retired Group Vice President, Global Purchasing, Ford Motor Company, a global automotive industry leader. Mr. Brown served in various leadership capacities....', 'David B. Dillon': '\nBiography\n\n                —\n            \n....', 'Michael L Eskew': '\nBiography\n\n                Mr. Eskew is the Retired Chairman of the Board and Chief Executive Officer, United Parcel Service, Inc., a provider of specialized transportation and logistics services. Mr....', 'Herbert L. Henkel': '\nBiography\n\n                Mr. Henkel is the Retired Chairman of the Board and Chief Executive Officer, Ingersoll-Rand plc, a manufacturer of industrial products and components. Mr. Henkel retired as....', 'Amy Hood': "\nBiography\n\n                On August 13, 2017, the Board of Directors of 3M Company elected Amy E. Hood to the Company's Board of Directors, effective August 13, 2017. At Microsoft, Hood is responsib....", 'Muhtar Kent': "\nBiography\n\n                Mr. Kent is the Chairman of the Board and Chief Executive Officer, The Coca-Cola Company, the world's largest beverage company. Mr. Kent has held the position of Chairman o....", 'Edward M. Liddy': '\nBiography\n\n                Mr. Liddy is the Retired Chairman of the Board and Chief Executive Officer, The Allstate Corporation, and former Partner at Clayton, Dubilier & Rice, LLC, a private equity ....', 'Dambisa F. Moyo': "\nBiography\n\n                On August 12, 2018, the Board of Directors of 3M Company elected Dambisa F. Moyo to the Company's Board of Directors, effective August 12, 2018. Dr. Moyo is the founder and....", 'Gregory R. Page': "\nBiography\n\n                On February 1, 2016, the Board of Directors of 3M Company elected Gregory R. Page to the Company's Board of Directors, effective February 1, 2016. Page previously was Cargi....", 'Patricia A. Woertz': "\nBiography\n\n                On February 1, 2016, the Board of Directors of 3M Company elected Patricia A. Woertz to the Company's Board of Directors, effective at the close of business on February 2, ...."}

编辑:

将结果写入csv

import csv
with open('filename.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([['name', 'biography'], *map(list, _board.items())])

要创建更通用的解决方案来处理不同的url(可能是从列表中的内容创建的):

def scrape_bios(_driver:webdriver, _url:str) -> dict:
  _driver.get(_url)
  time.sleep(5)
  _exec = _driver.find_elements_by_class_name("mds-button")
  _exec[8].click()
  time.sleep(3)
  _board = {}
  _driver.find_element_by_link_text("Board of Directors").click()
  time.sleep(3)
  full_directors = _driver.find_elements_by_class_name('person-row')[19:31]
  for _name, _link in zip(get_directors(_driver.page_source), full_directors):
    _link.click()
    time.sleep(3)
    _driver.find_element_by_link_text("Profile").click()
    time.sleep(3)
    _board[_name] = soup(_driver.page_source, 'html.parser').find_all('div', {'class':'biography'})[-1].text
    _link.click()
    time.sleep(3)
  return _board

现在,您可以遍历网址列表:

d = webdriver.Chrome('/path/to/chromedriver')
for url in urls:
  _results = scrape_bios(d, url)