使用美丽的汤提取链接的特定部分

时间:2018-07-14 02:38:00

标签: python web-scraping beautifulsoup

下面是我的网络抓取工具的一部分,该部分从此website抓取球队花名册,将球员信息放入数组中,并将数组导出到CSV文件中的列中。我的抓取工具工作正常,但我也想拉出玩家的ID号,该ID号嵌套在玩家的ahref链接中。

<a href="/player/542882/matt-andriese">Matt Andriese</a>

从我的代码中可以看到,我已经在搜索('a')来提取玩家名称(Matt Andriese),但我也想提取嵌套在链接中的玩家编号(542882)。有谁知道如何解决这个问题?预先感谢!

import requests
import csv
from bs4 import BeautifulSoup

page = requests.get('http://m.rays.mlb.com/roster/')
soup = BeautifulSoup(page.text, 'html.parser')

soup.find(class_='nav-tabset-container').decompose()
soup.find(class_='column secondary span-5 right').decompose()

roster = soup.find(class_='layout layout-roster')
names = [n.contents[0] for n in roster.find_all('a')]
number = [n.contents[0] for n in roster.find_all('td', index='0')]
handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
height = [n.contents[0] for n in roster.find_all('td', index='4')]
weight = [n.contents[0] for n in roster.find_all('td', index='5')]
DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
team = [soup.find('meta',property='og:site_name')['content']] * len(names)

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','Number','Hand','Height','Weight','DOB','Team'])
    f.writerows(zip(names, number, handedness, height, weight, DOB, team))

2 个答案:

答案 0 :(得分:1)

如果link是与标签相对应的对象,则可以将href的值作为link['href']获得。为了安全起见,您可能需要通过选中href来确保标签中有if 'href' in link属性。获取网址后,split/个网址。

对于您而言,您可以执行以下操作:

ids = [n['href'].split('/')[2] for n in roster.find_all('a')]

答案 1 :(得分:1)

您可以使用re

import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://m.mlb.com/tb/roster').text, 'html.parser')
headers = [['td', 'dg-jersey_number'], ['td', 'dg-player_headshot', lambda x:x.find('img')['src']], ['td', 'dg-name_display_first_last', lambda x:re.findall('\d+', x.find('a')['href'])[0]], ['td', 'dg-bats_throws'], ['td', 'dg-height'], ['td', 'dg-weight'], ['td', 'dg-date_of_birth']]
def get_data(d):
  return [[lambda x:x.text, None if not c else c[0]][bool(c)](d.find(a, {'class':b})) for a, b, *c in headers]

final_results = [get_data(i) for i in d.find_all('tr', {'index':re.compile('\d+')})]

输出:

[['46', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/621237@2x.jpg', '621237', 'L/L', '6\'2"', '245lbs', '5/21/95'], ['35', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/542882@2x.jpg', '542882', 'R/R', '6\'2"', '225lbs', '8/28/89'], ['22', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/502042@2x.jpg', '502042', 'R/R', '6\'2"', '195lbs', '9/26/88'], ['63', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/650895@2x.jpg', '650895', 'R/R', '6\'3"', '240lbs', '1/18/94'], ['24', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/543135@2x.jpg', '543135', 'R/R', '6\'2"', '225lbs', '2/13/90'], ['58', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/629496@2x.jpg', '629496', 'R/R', '6\'0"', '220lbs', '11/4/93'], ['36', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/552640@2x.jpg', '552640', 'R/R', '6\'1"', '200lbs', '3/17/90'], ['56', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/592473@2x.jpg', '592473', 'L/L', '6\'3"', '205lbs', '1/14/89'], ['54', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/489265@2x.jpg', '489265', 'R/R', '5\'11"', '185lbs', '3/4/83'], ['57', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/621289@2x.jpg', '621289', 'R/R', '5\'10"', '200lbs', '6/20/91'], ['4', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/605483@2x.jpg', '605483', 'L/L', '6\'4"', '200lbs', '12/4/92'], ['55', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/592773@2x.jpg', '592773', 'R/R', '6\'4"', '215lbs', '7/26/91'], ['61', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/621056@2x.jpg', '621056', 'R/R', '6\'1"', '165lbs', '8/12/93'], ['48', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/642232@2x.jpg', '642232', 'R/L', '6\'5"', '205lbs', '12/31/91'], ['40', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/467092@2x.jpg', '467092', 'R/R', '6\'1"', '245lbs', '8/10/87'], ['45', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/491696@2x.jpg', '491696', 'R/R', '6\'0"', '200lbs', '4/30/88'], ['9', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/641343@2x.jpg', '641343', 'L/L', '6\'1"', '195lbs', '10/6/95'], ['26', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/596847@2x.jpg', '596847', 'L/R', '6\'1"', '230lbs', '5/19/91'], ['44', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/543068@2x.jpg', '543068', 'R/R', '6\'4"', '235lbs', '1/5/90'], ['5', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/622110@2x.jpg', '622110', 'R/R', '6\'2"', '170lbs', '1/15/91'], ['11', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/588751@2x.jpg', '588751', 'R/R', '6\'0"', '195lbs', '4/15/89'], ['28', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/621002@2x.jpg', '621002', 'R/R', '5\'11"', '200lbs', '3/22/94'], ['18', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/621563@2x.jpg', '621563', 'L/R', '6\'1"', '190lbs', '4/26/90'], ['27', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/460576@2x.jpg', '460576', 'R/R', '6\'3"', '220lbs', '12/4/85'], ['39', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/595281@2x.jpg', '595281', 'L/R', '6\'1"', '215lbs', '4/22/90'], ['0', 'http://gdx.mlb.com/images/gameday/mugshots/mlb/605480@2x.jpg', '605480', 'L/R', '5\'10"', '180lbs', '5/6/93']]

请注意,输出包含玩家ID作为每个子列表中的第三个元素。