如何使用 Python 通过网页抓取来提取信息?

时间:2021-05-01 18:59:52

标签: python pandas web-scraping beautifulsoup

我需要从网站中提取信息。 如果你去这个网站,左边会有项目列表,如果你点击其中一个选项, 在右侧,您将获得带有名称和代码的表格。我需要创建一个包含从网站上抓取的代码和名称列的数据框? 在某些选项中它没有给出名称和代码表,应该跳过。

输出数据框列:

Name   Code

1 个答案:

答案 0 :(得分:1)

您可以使用此脚本从该站点获取所有 ID、名称和代码:

import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://taxonomy.nucc.org/"
page_url = "https://taxonomy.nucc.org/Default/GetContentByItemId/"

html_doc = requests.get(url).text

treenodes = re.search(r"var treenodes = (\[.*\]);", html_doc)
treenodes = json.loads(treenodes.group(1))

all_data = []

for n in treenodes:
    data = requests.get(page_url + str(n["id"])).json()
    soup = BeautifulSoup(data.get("PartialViewHtml", ""), "html.parser")
    code = soup.select_one('label[for="Code"]')
    code = code.find_next("td").get_text(strip=True) if code else None

    print(n["id"], n["name"], code)

    all_data.append(
        {
            "id": n["id"],
            "name": n["name"],
            "code": code,
        }
    )

df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)

打印:

...
82   1863                            Attendant Care Provider  3747A0650X
83   1864                            Personal Care Attendant  3747P1801X
84   1866                 Advanced Practice Dental Therapist  125K00000X
85   1867                                   Dental Assistant  126800000X
86   1868                                   Dental Hygienist  124Q00000X
87   1869                       Dental Laboratory Technician  126900000X
88   1870                                   Dental Therapist  125J00000X
89   1871                                            Dentist  122300000X
90   1884                                          Denturist  122400000X
91   1885                                    Oral Medicinist  125Q00000X
92   1872                               Dental Public Health  1223D0001X
93   1873                           Dentist Anesthesiologist  1223D0004X
94   1874                                        Endodontics  1223E0200X
95   1875                                   General Practice  1223G0001X
96   1876                   Oral and Maxillofacial Pathology  1223P0106X
97   1877                   Oral and Maxillofacial Radiology  1223X0008X
98   1878                     Oral and Maxillofacial Surgery  1223S0112X
99   1879                                     Orofacial Pain  1223X2210X
100  1880           Orthodontics and Dentofacial Orthopedics  1223X0400X
...

并保存 data.csv(来自 LibreOffice 的屏幕截图):

enter image description here