如何在数据绑定之间抓取文本?

时间:2018-08-08 21:27:25

标签: python web-scraping beautifulsoup

我正在尝试抓取html代码中的名称

<h3><a data-bind="'attr': { 'href': PersonURL }, 'text': PersonName" 
    href="/bios/mbaxter">Michael N. Baxter</a></h3>

我的代码在下面

url="https://www.morganlewis.com/our-people-results?pagenum=1&sortingqs=Last%20name&pagesize=500&currentGroup=36ef4ad43dea406895fa2d41af32fada&filtergroup=Office&loadCategories=true&param_sitecontentcategory=OUR%20PEOPLE&schoolsearchstring=villanova&subCatInfo=Office,36ef4ad43dea406895fa2d41af32fada&subCatText=Office%20%3A%20Philadelphia"
tag='h3'
cls="data-bind"
def name_scrape(url,tag,cls):
    page=requests.get(url) 
    soup=BeautifulSoup(page.content,'html.parser')
    #print(soup.prettify())
    find_name=soup.find_all(tag,class_=cls)
    for entry in find_name:
       print(entry)

name_scrape(url,tag,cls)

似乎该名称在'data-bind'类中。我如何确保我能刮出这个名字?

1 个答案:

答案 0 :(得分:3)

该网站的内容非常动态。因此,您有两个选择:要么使用任何selenium之类的浏览器模拟器,要么使用包含json数据的正确url。后者无疑是最好的方法。

这是您抓住它们的方法(简单方法):

import requests

url = "https://www.morganlewis.com/biosearchnew/execute?pagenum=1&isInternalBioRequest=false&SortingField=Last%20name&currentGroup=36ef4ad43dea406895fa2d41af32fada&loadCategories=true&param_sitecontentcategory=OUR%20PEOPLE&pagesize=500&schoolsearchstring=villanova&personofficeitem_sm=36ef4ad43dea406895fa2d41af32fada"

res = requests.get(url)
for items in res.json()['SearchResults']:
    print(items['Title'])

部分回复:

Lindsay Ann Barci
Michael N. Baxter
Jeannine T. Bishop
Jeffrey P. Bodle
Sarah E. Bouchard
Brandon J. Brigham
Amanda M. Bruno
Evan W. Busteed