Question

我正在尝试获取http://www.pro-football-reference.com/years/1932中所有表的代码，但我只是获得了第一个表。

我尝试将解析器切换为lxml，但它仍然为总表提供长度值1。

它也没有给出所有标签，如divs，tds ......等等。

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.pro-football-reference.com/years/1932'

url = base_url

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

print len(soup.find_all('table'))

Answer 1

这是因为您的页面通过javascript呈现表格。所以有两种方法可以做到这一点

第一个是使用像Selenium这样的javascript的抓取引擎。
第二个是在html内容中查找并自行呈现。

对于这段代码我接近第二个。发现该表已被<--和-->隐藏。找到所有这些东西并替换它。

import re
from bs4 import BeautifulSoup
import requests

base_url = 'http://www.pro-football-reference.com/years/1932/'

url = base_url

r = requests.get(url)
content = re.sub(r'(?m)^\<!--.*\n?', '', r.content)
content = re.sub(r'(?m)^\-->.*\n?', '', content)
soup = BeautifulSoup(content, 'html.parser')
print len(soup.find_all('table'))
for table in soup.find_all('table'):
    if table.findParent("table") is None:
        print "\n\n", str(table)

找不到网站上的所有表格

1 个答案: