我需要从网站中提取信息。 如果你去这个网站,左边会有项目列表,如果你点击其中一个选项, 在右侧,您将获得带有名称和代码的表格。我需要创建一个包含从网站上抓取的代码和名称列的数据框? 在某些选项中它没有给出名称和代码表,应该跳过。
输出数据框列:
Name Code
答案 0 :(得分:1)
您可以使用此脚本从该站点获取所有 ID、名称和代码:
import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://taxonomy.nucc.org/"
page_url = "https://taxonomy.nucc.org/Default/GetContentByItemId/"
html_doc = requests.get(url).text
treenodes = re.search(r"var treenodes = (\[.*\]);", html_doc)
treenodes = json.loads(treenodes.group(1))
all_data = []
for n in treenodes:
data = requests.get(page_url + str(n["id"])).json()
soup = BeautifulSoup(data.get("PartialViewHtml", ""), "html.parser")
code = soup.select_one('label[for="Code"]')
code = code.find_next("td").get_text(strip=True) if code else None
print(n["id"], n["name"], code)
all_data.append(
{
"id": n["id"],
"name": n["name"],
"code": code,
}
)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=False)
打印:
...
82 1863 Attendant Care Provider 3747A0650X
83 1864 Personal Care Attendant 3747P1801X
84 1866 Advanced Practice Dental Therapist 125K00000X
85 1867 Dental Assistant 126800000X
86 1868 Dental Hygienist 124Q00000X
87 1869 Dental Laboratory Technician 126900000X
88 1870 Dental Therapist 125J00000X
89 1871 Dentist 122300000X
90 1884 Denturist 122400000X
91 1885 Oral Medicinist 125Q00000X
92 1872 Dental Public Health 1223D0001X
93 1873 Dentist Anesthesiologist 1223D0004X
94 1874 Endodontics 1223E0200X
95 1875 General Practice 1223G0001X
96 1876 Oral and Maxillofacial Pathology 1223P0106X
97 1877 Oral and Maxillofacial Radiology 1223X0008X
98 1878 Oral and Maxillofacial Surgery 1223S0112X
99 1879 Orofacial Pain 1223X2210X
100 1880 Orthodontics and Dentofacial Orthopedics 1223X0400X
...
并保存 data.csv
(来自 LibreOffice 的屏幕截图):