应用错误收集

无法使用python中的BeautifulSoup在soup.findAll（'table'）中找到表格

时间：2013-10-31 07:51:06

标签： python-2.7 beautifulsoup tags find findall

我正在使用soup.findAll（'table'）来尝试在html文件中找到该表，但它不会出现。该表确实存在于文件中，并且正则表达式能够以这种方式找到它：

import sys
import urllib2
from bs4 import BeautifulSoup
import re
webpage = open(r'd:\samplefile.html', 'r').read()
soup = BeautifulSoup(webpage)
print re.findall("TABLE",webpage)   #works, prints ['TABLE','TABLE']
print soup.findAll("TABLE")   # prints an empty list []

我知道自从我这样做以来，我正在正确地生成汤：

print [tag.name for tag in soup.findAll(align=None)]

它将正确打印它找到的标签。我已经尝试过不同的方法来写“TABLE”，如“table”，“Table”等。此外，如果我打开文件并使用文本编辑器对其进行编辑，则它上面有“TABLE”。

为什么beautifulsoup找不到桌子？

1 个答案:

答案 0 :(得分：1)

上下文

python 2.x
BeautifulSoup HTML解析器

问题

bsoup findall未返回所有预期的标记，或者根本不返回任何标记，即使用户知道标记中存在标记

解决方案

在初始化BeautifulSoup构造函数

## BEFORE
soup = BeautifulSoup(webpage)

## AFTER
soup = BeautifulSoup(webpage, "html5lib")

原理

目标标记可能包含格式错误的HTML，并且不同的解析器有不同程度的成功。

另见

related post by Martijn that addresses the same issue