使用BeautifulSoup访问评论的HTML行

时间:2017-07-15 23:00:02

标签: python-3.x beautifulsoup

我正在尝试从这个特定网页上截取统计信息:https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/

但是,当我查看HTML源代码时,“防御性游戏日志”的表格似乎已被注释掉(以< ...!开头 - 以 - >开头)

因此,在尝试使用BeautifulSoup4时,以下代码仅捕获在防御性数据被注释掉时未注释掉的令人反感的数据。

from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
import re

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link.read(), "lxml")


tables = soup.find_all(['th', 'tr'])
my_table = tables[0]
rows = my_table.findChildren(['tr'])
for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        value = cell.string
        print(value)

我很好奇是否有任何解决方案可以将所有防御值添加到列表中,就像在BeautifulSoup4内部或外部存储攻击性数据一样。谢谢!

请注意,我添加了下面从here派生的解决方案:

data = []

table = defensive_log
table_body = table.find('tbody')

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values

1 个答案:

答案 0 :(得分:3)

Comment对象会为您提供所需内容:

from urllib.request import Request,urlopen
from bs4 import BeautifulSoup, Comment

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link, "lxml")

comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
    comment=BeautifulSoup(str(comment), 'lxml')
    defensive_log = comment.find('table') #search as ordinary tag
    if defensive_log:
        break