使用Python Beautifulsoup的Web爬网遍历HTML标签

时间:2019-01-30 17:52:51

标签: python beautifulsoup

我正在使用Python Beautifulsoup从以下URL'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'进行网络抓取。 我想从URL上刮除玩家的姓名,他们的受伤情况以及受伤的星期。 我可以抓取第一周的信息,显示以下结果:

[['Danny Amendola'], 'Questionable: hamstring', 'week_1']
[['Armond Armstead'], 'Out: infection', 'week_1']
[['Kyle Arrington'], 'NA', 'week_1']
[['Brandon Bolden'], 'Questionable: knee', 'week_1']
... and so on for all the week 1 injuries.

但是一旦显示了第1周的所有伤害,它将停止。

我希望结果能够持续到第2周,第3周,第4周...等。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

containers = page_soup.find("tbody")
head = page_soup.find("thead")


player = containers.find_all("tr")

for tr in player:
    th = tr.find_all("th")
    name = [i.text for i in th]

    week = tr.td["data-stat"]

    try:
        injury = tr.td["data-tip"]
        print([name, injury, week])
    except KeyError:
        injury = "NA"
        print([name, injury, week])

我要寻找的结果是该代码可打印URL中表中显示的所有星期中球员的姓名,他们的受伤情况以及受伤一周。 例如,一旦打印了所有第1周的伤害,我希望它显示所有第2周和第3周的伤害,依此类推。 所以看起来像这样:

[['Adrian Wilson'], 'Injured Reserve: hamstring', 'week_1']
[['Tavon Wilson'], 'NA', 'week_1']
[['Markus Zusevics'], 'Injured Reserve: undisclosed', 'week_1']
[['Danny Amendola'], 'Questionable: groin', 'week_2']
...

3 个答案:

答案 0 :(得分:1)

您仅迭代数据提示的第一个实例,这应该可以工作:

player = containers.find_all("tr")
for tr in player:
   th = tr.find_all("th")
   name = [i.text for i in th]
   for td in tr.findAll('td'): 
       week = td["data-stat"]
       try:
           injury = td["data-tip"]
           print([name, injury, week])
       except KeyError:
           injury = "NA"
           print([name, injury, week])

答案 1 :(得分:0)

Despawner

答案 2 :(得分:0)

代码:

a = [doug, dofug] b = [goud, doaaug] 

输出:

import re
import requests
from bs4 import BeautifulSoup as soup

html = requests.get('https://www.pro-football-reference.com/teams/nwe/2013_injuries.htm').text
overall = []
page_soup = soup(html, 'html.parser')
containers = page_soup.find('tbody')
players = containers.find_all('tr')
for player in players:
    th = player.find_all('th')
    name = [i.text for i in th]
    tds = player.find_all('td', {'class': re.compile('^center poptip')})
    weeklyInjuries = ', '.join([', '.join(i) for i in [list(a) for a in zip([i['data-tip'] for i in tds], [i['data-stat'] for i in tds])]])
    if len(weeklyInjuries) == 0:
        weeklyInjuries = 'N/A'
    print([name, weeklyInjuries])