Question

我在使用BeautifulSoup刮桌子时遇到了麻烦。这是我的代码

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")

stats = soup.find('table',  id = 'totals')

In [78]: print(stats)
None

当我右键单击表格以检查元素时，HTML看起来像我期望的那样，但是当我查看源代码时，唯一具有id ='totals'的元素被注释掉了。有没有办法从注释的源代码中删除一个表？

我引用了this post但似乎无法复制他们的解决方案。

这是我感兴趣的link to the webpage。我想刮掉标有“Totals”的表并将其存储为数据框。

我对Python，HTML和网络抓取相对较新。任何帮助将不胜感激。

提前致谢。

迈克尔

Answer 1

注释是BeautifulSoup中的字符串实例。您可以将BeautifulSoup的find方法与正则表达式一起使用，以查找您之后的特定字符串。一旦你有了这个字符串，就让BeautifulSoup解析那个你去的地方。

换句话说，

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")

stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")

print(stats_soup.table.caption.text)

Answer 2

你可以这样做：

from urllib2 import *
from bs4 import BeautifulSoup

site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")

stats = soup.findAll('div', id = 'all_totals')
print stats

如果我帮忙，请告诉我！

Python BeautifulSoup找不到表ID

2 个答案: