我正在尝试抓取html代码中的名称
<h3><a data-bind="'attr': { 'href': PersonURL }, 'text': PersonName"
href="/bios/mbaxter">Michael N. Baxter</a></h3>
我的代码在下面
url="https://www.morganlewis.com/our-people-results?pagenum=1&sortingqs=Last%20name&pagesize=500¤tGroup=36ef4ad43dea406895fa2d41af32fada&filtergroup=Office&loadCategories=true¶m_sitecontentcategory=OUR%20PEOPLE&schoolsearchstring=villanova&subCatInfo=Office,36ef4ad43dea406895fa2d41af32fada&subCatText=Office%20%3A%20Philadelphia"
tag='h3'
cls="data-bind"
def name_scrape(url,tag,cls):
page=requests.get(url)
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all(tag,class_=cls)
for entry in find_name:
print(entry)
name_scrape(url,tag,cls)
似乎该名称在'data-bind'类中。我如何确保我能刮出这个名字?
答案 0 :(得分:3)
该网站的内容非常动态。因此,您有两个选择:要么使用任何selenium
之类的浏览器模拟器,要么使用包含json数据的正确url。后者无疑是最好的方法。
这是您抓住它们的方法(简单方法):
import requests
url = "https://www.morganlewis.com/biosearchnew/execute?pagenum=1&isInternalBioRequest=false&SortingField=Last%20name¤tGroup=36ef4ad43dea406895fa2d41af32fada&loadCategories=true¶m_sitecontentcategory=OUR%20PEOPLE&pagesize=500&schoolsearchstring=villanova&personofficeitem_sm=36ef4ad43dea406895fa2d41af32fada"
res = requests.get(url)
for items in res.json()['SearchResults']:
print(items['Title'])
部分回复:
Lindsay Ann Barci
Michael N. Baxter
Jeannine T. Bishop
Jeffrey P. Bodle
Sarah E. Bouchard
Brandon J. Brigham
Amanda M. Bruno
Evan W. Busteed