我需要包含来自 iana.org 的顶级域的刮表。
我的代码:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.iana.org/domains/root/db'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='tld-table')
我怎样才能把它放到带有网站结构(域、类型、TLD 管理器)的 Pandas DataFrame 中。
答案 0 :(得分:2)
Pandas 已经自带了读取表格的东西from html,不需要使用 BeautifulSoup:
import pandas as pd
url = "https://www.iana.org/domains/root/db"
# This returns a list of DataFrames with all tables in the page.
df = pd.read_html(url)[0]
答案 1 :(得分:1)
您可以使用熊猫pd.read_html
import pandas as pd
URL = "https://www.iana.org/domains/root/db"
df = pd.read_html(URL)[0]
print(df.head())
Domain Type TLD Manager
0 .aaa generic American Automobile Association, Inc.
1 .aarp generic AARP
2 .abarth generic Fiat Chrysler Automobiles N.V.
3 .abb generic ABB Ltd
4 .abbott generic Abbott Laboratories, Inc.