Question

我正在尝试从以下页面提取TR数据： http://www.datasheetcatalog.com/catalog/p1342320.shtml

我正在使用请求和BeautifulSoup。但是，我没有得到所有行（第二个表中只有12行而不是22行）。有人对此有一个解释吗（前提是在打印response.content时这些行在那里）？

这是我正在使用的代码：

from bs4 import BeautifulSoup
import requests

session = requests.Session()

url = 'http://www.datasheetcatalog.com/catalog/p1342320.shtml'
response = session.get(url)

soup = BeautifulSoup(response.content,"lxml")

trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

Answer 1

在对html页面进行了详细检查之后，我发现beautifulsoup在打了注释（）后停止了。因此解决方案是将解析器从“ lxml”更改为“ html5lib”：

soup = BeautifulSoup(response.content,"html5lib")

Answer 2

损坏import * as e from "express"; import { Response } from "express-serve-static-core"; e.response.send100 = function(): e.Response { var response = this as Response; response.status(100).end(); }的html无效

BeautifulSoup

注意：使用.... html_doc = response.text.replace('<table <', '<') html_doc = re.sub(r'<\!--\s+\d+\s+--\!>', '', html_doc) html_doc = re.sub(r'</?font.*?>' ,'', html_doc) soup = BeautifulSoup(html_doc, "html.parser") trs= soup.findAll('table')[8].findAll('tr') print (len(trs))返回7而不是22

Python请求未提取所有元素

2 个答案: