Question

我正在查看以下网站：

https://modules.ussquash.com/ssm/pages/leagues/League_Information.asp?leagueid=1859

我想提取每所大学的名称以及与之相关的href。因此，对于第一个条目，我想获得Stanford和https://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=18564

我已经达到了使用BeautifulSoup的所有TD的地步。我只是难以提取学校及其href。

这是我的尝试：

def main():
    r = requests.get('https://modules.ussquash.com/ssm/pages/leagues/League_Information.asp?leagueid=1859')
    data = r.text
    soup = BeautifulSoup(data)
    table = soup.find_all('table')[1]
    rows = table.find_all('tr')[1:]
    for row in rows:
        cols = row.find_all('td')
        print(cols)

当我尝试访问cols[0]时，我得到：

IndexError: list index out of range

任何想法如何解决这个问题都很棒！

由于

Answer 1

前两个 tr的位于 thead 中，没有 td 标签，你想跳过前两个tr：

students [{id: [firstname,lastname,password]}, {id: [firstname,lastname,password]}]

为了得到你想要的东西，我们可以简化使用 css选择器：

rows = table.find_all('tr')[2:]

此外 href 是一个相对路径，因此您需要将其连接到基本网址：

table = soup.find_all('table', limit=2)[1]

# skip first two tr's
rows = table.select("tr + tr + tr")
for row in rows:
    # anchor we want is inside the first td
    a = row.select_one("td a") # or  a = row.find("td").a
    print(a.text,a["href"])

BeautifulSoup：无法访问TD内的信息

1 个答案: