如何使用python中的漂亮汤从以下Wiki页面的前两个表中提取Alpha-3代码?
https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language')
soup = bs(r.content, 'lxml')
table = soup.find_all('table', class_='wikitable')[0]
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
output_rows[1][2].rstrip('\n')
output_rows[2][2].rstrip('\n')
output_rows[3][2].rstrip('\n')
output_rows[4][2].rstrip('\n')
答案 0 :(得分:0)
使用熊猫获取表,然后仅追加前两个表(如果需要所有数据),或者仅获取Alpha-3
列。
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language'
dfs = pd.read_html(url)
df = pd.DataFrame()
for table in dfs[:3]:
df = df.append(table, sort=True).reset_index(drop=True)
alpha3 = list(df['Alpha-3 code'].dropna())
输出:
print (alpha3)
['AUS', 'NZL', 'GBR', 'USA', 'ATG', 'BHS', 'BRB', 'BLZ', 'BWA', 'BDI', 'CMR', 'CAN', 'COK', 'DMA', 'SWZ', 'FJI', 'GMB', 'GHA', 'GRD', 'GUY', 'IND', 'IRL', 'JAM', 'KEN', 'KIR', 'LSO', 'LBR', 'MWI', 'MLT', 'MHL', 'MUS', 'FSM', 'NAM', 'NGA', 'NIU', 'PAK', 'PLW', 'PNG', 'PHL', 'RWA', 'KNA', 'LCA', 'VCT', 'WSM', 'SYC', 'SLE', 'SGP', 'SLB', 'ZAF', 'SSD', 'SDN', 'TZA', 'TON', 'TTO', 'TUV', 'UGA', 'VUT', 'ZMB', 'ZWE']