Question

用于抓取的Wiki页面：https://en.wikipedia.org/wiki/List_of_Test_cricket_triple_centuries

我想要比分，击球手，对阵和对垒。我采用的方法是分别提取每个列，然后将它们组合成一个熊猫数据框。

我希望获得一些帮助：

我可以在其他for循环的顶部添加代码以提取得分列。
上传到熊猫中，这样我就可以将所有数据放在一个表中。

只是从我的python旅程开始，所以非常感谢所有帮助！

代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

wiki = "https://en.wikipedia.org/wiki/List_of_Test_cricket_triple_centuries"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url, "lxml")

my_table = soup.find("table", {"class":"wikitable sortable"})

score = [] #Need assistance extracting this column #
batsmen = []
team = []  #'For' column in Wiki#
against = []
ground = []

# Would like to add the code to extract the Score column here #
for row in my_table.find_all("tr")[1:]:
    batsmen_cell = row.find_all("a")[0]
    batsmen.append(batsmen_cell.text)
for row in my_table.find_all("tr")[1:]:
    team_cell = row.find_all("a")[1]
    team.append(team_cell.text)    
for row in my_table.find_all("tr")[1:]:
    against_cell = row.find_all("a")[2]
    against.append(against_cell.text)
for row in my_table.find_all("tr")[1:]:
    ground_cell = row.find_all("a")[5]
    ground.append(ground_cell.text)   
    
data = [batsmen, team, against, ground]

df = pd.DataFrame(data, columns = ["Batsmen", "For", "Against", "Ground"])
print(df)

Answer 1

在这种情况下，将页面直接加载到熊猫会更容易：

tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Test_cricket_triple_centuries')
tables[1]

输出是您要查找的表。只需使用标准的pandas方法删除不必要的列即可。

从Wikipedia抓取网页并上传到熊猫

1 个答案: