BeautifulSoup的“发现”表现不一致(bs4)

时间:2015-06-25 18:05:17

标签: python python-2.7 beautifulsoup

我正在抓住NFL网站的球员统计数据。在解析网页并尝试访问包含我正在寻找的实际信息的HTML表时,我遇到了问题。我成功下载了页面并将其保存到我正在使用的目录中。作为参考,我可以找到我保存的页面here

# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result

我发现在某一点上,我运行了代码,结果打印出了我正在寻找的正确表格。每隔一段时间,它不包含任何东西!我假设这是用户错误,但我无法弄清楚我错过了什么。使用“lxml”没有返回任何内容,我无法使html5lib工作(解析库??)。

感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

First, you should read the contents of your file before passing it to BeautifulSoup.

soup = BeautifulSoup(open("1998.html").read())

Second, verify manually that the table in question exists in the HTML by printing the contents to screen. The .prettify() method makes the data easier to read.

print soup.prettify()

Lastly, if the element does in fact exist, the following will be able to find it:

table = soup.find('table',{'id':'result'})

A simple test script I wrote cannot reproduce your results.

import urllib
from bs4 import BeautifulSoup

def test():
    # The URL of the page you're scraping.
    url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'

    # Make a request to the URL.
    conn = urllib.urlopen(url)

    # Read the contents of the response
    html = conn.read()

    # Close the connection.
    conn.close()

    # Create a BeautifulSoup object and find the table.
    soup = BeautifulSoup(html)
    table = soup.find('table',{'id':'result'})

    # Find all rows in the table.
    trs = table.findAll('tr')

    # Print to screen the number of rows found in the table.
    print len(trs)

This outputs 51 every time.