使用Beautiful Soup和Python从Wiki抓取表格数据

时间:2019-12-06 04:39:41

标签: python-3.x web-scraping beautifulsoup

如何使用python中的漂亮汤从以下Wiki页面的前两个表中提取Alpha-3代码?

https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language')
soup = bs(r.content, 'lxml')

table = soup.find_all('table', class_='wikitable')[0]

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

output_rows[1][2].rstrip('\n')
output_rows[2][2].rstrip('\n')
output_rows[3][2].rstrip('\n')
output_rows[4][2].rstrip('\n')

1 个答案:

答案 0 :(得分:0)

使用熊猫获取表,然后仅追加前两个表(如果需要所有数据),或者仅获取Alpha-3列。

import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language'
dfs = pd.read_html(url)

df = pd.DataFrame()
for table in dfs[:3]:
    df = df.append(table, sort=True).reset_index(drop=True)

alpha3 = list(df['Alpha-3 code'].dropna())

输出:

print (alpha3)
['AUS', 'NZL', 'GBR', 'USA', 'ATG', 'BHS', 'BRB', 'BLZ', 'BWA', 'BDI', 'CMR', 'CAN', 'COK', 'DMA', 'SWZ', 'FJI', 'GMB', 'GHA', 'GRD', 'GUY', 'IND', 'IRL', 'JAM', 'KEN', 'KIR', 'LSO', 'LBR', 'MWI', 'MLT', 'MHL', 'MUS', 'FSM', 'NAM', 'NGA', 'NIU', 'PAK', 'PLW', 'PNG', 'PHL', 'RWA', 'KNA', 'LCA', 'VCT', 'WSM', 'SYC', 'SLE', 'SGP', 'SLB', 'ZAF', 'SSD', 'SDN', 'TZA', 'TON', 'TTO', 'TUV', 'UGA', 'VUT', 'ZMB', 'ZWE']