熊猫尝试抓取表格时返回空数据框

时间:2020-04-09 06:42:41

标签: python web-scraping beautifulsoup

我正在尝试获取Transfermarkt上前500名最有价值球员的转会历史。我设法(在一些帮助下)遍历了每个玩家的个人资料以及抓取的图片和名称。现在,我想要转移历史记录,可以在每个玩家资料的表格中找到它们:Player Profile enter image description here

我想使用Pandas将表保存在数据框中,然后以Season,Date等作为标题将其写入CSV。例如,对于摩纳哥和PSG,我只想要俱乐部的名称,而不是图片或国籍。但是现在,我得到的只是这个:

Empty DataFrame
Columns: []
Index: []

预期输出:

Season         Date    Left Joined       MV      Fee
0  18/19  Jul 1, 2018  Monaco    PSG  120.00m  145.00m

我已经查看了源代码并检查了页面,但是除了tbody和tr之外,找不到任何对我有帮助的内容。但是我的操作方式是要精确调整该表,因为还有其他几个表。

这是我的代码:

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}

result = []

def main(url):
    with requests.Session() as req:
        result = []
        for item in range(1, 21):
            print(f"Collecting Links From Page# {item}")
            r = req.get(url.format(item), headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')

            tr = soup.find_all("tbody")[1].find_all("tr", recursive=False)

            result.extend([
                { 
                    "Season": t[1].text.strip()

                }
                for t in (t.find_all(recursive=False) for t in tr)
            ])

df = pd.DataFrame(result)

print(df)

1 个答案:

答案 0 :(得分:3)

import requests
from bs4 import BeautifulSoup
import pandas as pd

site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}


def main(url):
    with requests.Session() as req:
        links = []
        names = []

        for item in range(1, 21):
            print(f"Collecting Links From Page# {item}")
            r = req.get(url.format(item), headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')
            urls = [f"{url[:29]}{item.get('href')}" for item in soup.findAll(
                "a", class_="spielprofil_tooltip")]
            ns = [item.text for item in soup.findAll(
                "a", class_="spielprofil_tooltip")][:-5]
            links.extend(urls)
            names.extend(ns)
    return links, names


def parser():
    links, names = main(site)
    for link, name in zip(links, names):
        with requests.Session() as req:
            r = req.get(link, headers=headers)
            df = pd.read_html(r.content)[1]
            df.loc[-1] = name
            df.index = df.index + 1
            df.sort_index(inplace=True)
            print(df)


parser()