BeautifulSoup找不到所有的div标签

时间:2020-04-03 12:44:05

标签: python html beautifulsoup

我已经开始一个私人项目:在Visual Studio Code(1.41.0)中使用Python和BeautifulSoup进行网络抓取。

我能够抓取另一个结构与我的“问题站点”相同的站点。但是现在我遇到了,BeautifulSoup找不到所有div标签(每个站点应该有20个,而我只发现3个)。我已经告知自己有关Stack Overflow的信息,但是没有找到解决方案(或者显然不理解)。

网站:https://www.comparis.ch/gesundheit/arzt/pathologie

我感兴趣的html结构如下:

enter image description here

enter image description here

enter image description here

我从<div class="css-15dj4ut"></div>中获得了所有<div class="css-fh99y9 excbu0j0">...</div>,但没有从<div class="css-roynbj excbu0j0"></div>中获得。你知道为什么吗?

遍历每个网址以访问每个站点。

for i in range(0, endIndex):
try:
    if i == 0:
        urls.append(basicUrl)
        page = urllib.request.urlopen(urls[i])
        soup = BeautifulSoup(page, 'html.parser')

        getSurgeonName(soup)

    else:
        urls.append(basicUrl + urlAddon + str(i + 1))
        page = urllib.request.urlopen(urls[i])
        soup = BeautifulSoup(page, 'html.parser')

        getSurgeonName(soup)

except:
    print("An URL request error occured.")

功能版本1:

def getSurgeonName(soup):
    # gets just first 3 surgeons of site
    docName = re.compile('css-15dj4ut')
    docNameTags = soup.find_all('div', attrs={'class': docName})
    for a in docNameTags:
            docNameList.append(a.getText())

功能版本2:

def getSurgeonName(soup):

    parentClass = re.compile('css-fh99y9 excbu0j0')
    parentItems = soup.find_all('div', attrs={'class': parentClass})

    for parent in parentItems:
           children = parent.findChildren('div', {"class": "css-15dj4ut"}) 
           docNameList.append(children[0].getText())

    parentClass = re.compile('css-roynbj excbu0j0')
    parentItems = soup.find_all('div', attrs={'class': parentClass})

    for parent in parentItems:
           children = parent.findChildren('div', {'class': 'css-15dj4ut'}) 
           docNameList.append(children[0].getText())

1 个答案:

答案 0 :(得分:1)

实际上,您所需的desired数据是通过页面加载的JavaScript动态加载的,因此requests包将无法即时呈现JavaScript。但是我已经能够找到script标签,该标签将数据保存在string JSON的{​​{1}}中,然后将其加载到dict中。

在这里您可以解析任何您想要的内容:)。

JSON

类似的东西:

import requests
from bs4 import BeautifulSoup
import json

r = requests.get("https://www.comparis.ch/gesundheit/arzt/pathologie")
soup = BeautifulSoup(r.content, 'html.parser')
script = soup.find("script", {'id': '__NEXT_DATA__'}).text

data = json.loads(script)

print(data.keys())  # JSON Dict

dumper = json.dumps(data, indent=4)

print(dumper)  # to see it in human readble format

输出:

for item in data['props']['pageProps']['doctorResults']['doctorModels']:
    print(item['name'])
相关问题