Question

我在排序维基表时遇到了麻烦，并希望之前完成维基表的人可以给我建议。从List_of_current_heads_of_state_and_government我需要国家（使用下面的代码），然后只有第一次提到国家元首+他们的名字。我不确定如何分离第一次提及，因为它们都进入一个单元格。我试图提取他们的名字给了我这个错误：IndexError: list index out of range。非常感谢您的帮助！

import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')

my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)

states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
    state_cell = row.find_all('a')[0]  
    states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
    title_cell = row.find_all('a')[0]
    titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
    name_cell = row.find_all('a')[1]
    names.append(name_cell.text)
print(names)

理想的输出是pandas df：

State | Title | Name |

Answer 1

如果我能理解你的问题，那么以下内容可以帮助你：

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    try:
        country = data[0].a.text
        title = data[1].a.text
        name = data[1].a.find_next_sibling().text
    except IndexError:pass
    print("{}|{}|{}".format(country,title,name))

输出：

Afghanistan|President|Ashraf Ghani
Albania|President|Ilir Meta
Algeria|President|Abdelaziz Bouteflika
Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
Angola|President|João Lourenço
Antigua and Barbuda|Queen|Elizabeth II
Argentina|President|Mauricio Macri

等等----

Answer 2

我很欣赏这是一个旧线程，但是如果其他人想要做同样的事情，我发现了一种超级简单快捷的方法，方法是导入wikipedia python模块，然后使用pandas的{ {1}}放入数据框。从那里，您可以应用所需的任何数量的分析。

这是我的代码-从命令行调用：

只需通过read_html呼叫

python yourfile.py -p Wikipedia_Page_Article_Here

希望这可以帮助某个人！

不带命令行参数的OR：

import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try: 
    df = pd.read_html(html)[1]  # Try 2nd table first as most pages contain contents table first
except IndexError:
    df = pd.read_html(html)[0]
print(df.to_string())

Answer 3

它并不完美，但它几乎就是这样。

import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')

my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)

states = []
titles = []
names = []
""" for row in my_table.find_all('tr')[1:]:
    state_cell = row.find_all('a')[0]  
    states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
    title_cell = row.find_all('a')[0]
    titles.append(title_cell.text)
print(titles) """
for row in my_table.find_all('td'):
    try:
        names.append(row.find_all('a')[1].text)
    except IndexError:
        names.append(row.find_all('a')[0].text)

print(names)

到目前为止，我可以看到这个名单中只有一个错误。由于您必须编写例外，该表有点困难。例如，有些名称它们不是链接，然后代码只捕获它在该行中找到的第一个链接。但是你只需要为这种情况写一些if子句。至少我会这样做。

有选择地使用Python刮取维基百科表

3 个答案: