我正在尝试获取Transfermarkt上前500名最有价值球员的转会历史。我设法(在一些帮助下)遍历了每个玩家的个人资料以及抓取的图片和名称。现在,我想要转移历史记录,可以在每个玩家资料的表格中找到它们:Player Profile
我想使用Pandas将表保存在数据框中,然后以Season,Date等作为标题将其写入CSV。例如,对于摩纳哥和PSG,我只想要俱乐部的名称,而不是图片或国籍。但是现在,我得到的只是这个:
Empty DataFrame
Columns: []
Index: []
预期输出:
Season Date Left Joined MV Fee
0 18/19 Jul 1, 2018 Monaco PSG 120.00m 145.00m
我已经查看了源代码并检查了页面,但是除了tbody和tr之外,找不到任何对我有帮助的内容。但是我的操作方式是要精确调整该表,因为还有其他几个表。
这是我的代码:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
result = []
def main(url):
with requests.Session() as req:
result = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)
result.extend([
{
"Season": t[1].text.strip()
}
for t in (t.find_all(recursive=False) for t in tr)
])
df = pd.DataFrame(result)
print(df)
答案 0 :(得分:3)
import requests
from bs4 import BeautifulSoup
import pandas as pd
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
links = []
names = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
urls = [f"{url[:29]}{item.get('href')}" for item in soup.findAll(
"a", class_="spielprofil_tooltip")]
ns = [item.text for item in soup.findAll(
"a", class_="spielprofil_tooltip")][:-5]
links.extend(urls)
names.extend(ns)
return links, names
def parser():
links, names = main(site)
for link, name in zip(links, names):
with requests.Session() as req:
r = req.get(link, headers=headers)
df = pd.read_html(r.content)[1]
df.loc[-1] = name
df.index = df.index + 1
df.sort_index(inplace=True)
print(df)
parser()