Question

我正在使用beautifulsoup来查找我在本地保存的某个html页面中的所有p。我的代码是

with open ("./" + str(filename) + ".txt", "r") as myfile:
    data=myfile.read().replace('\n', '')
soup = BeautifulSoup(data)
t11 = soup.findAll("p", {"class": "commentsParagraph"})

这些代码适用于页面的一部分，但是页面的某些部分加载了ajax（我在保存源代码之前预先加载），并且代码无法正常工作。

为了对此进行测试，我添加了类p的ajax部分中的commentsParagraph2个标记之一，并将我的代码更改为

t11 = soup.findAll("p", {"class": "commentsParagraph2"})

但是t11是一个空列表。

我正在附加页面文件here

有什么想法吗？

Answer 1

你的html中有一个带有commentsParagraph2类的p标签，bs4可以使用所有三个解析器找到没有问题：

In [8]: from bs4 import BeautifulSoup
   ...: soup1 = BeautifulSoup(open("/home/padraic
   ...: /t.html").read(),"html5lib")
   ...: soup2 = BeautifulSoup(open("/home/padraic
   ...: /t.html"),"html.parser")
   ...: soup3 = BeautifulSoup(open("/home/padraic
   ...: /t.html"),"lxml")
   ...: print(soup1.select_one("p.commentsParagraph2"))
   ...: print(soup2.select_one("p.commentsParagraph2"))
   ...: print(soup3.select_one("p.commentsParagraph2"))
   ...: 
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>

因此，您使用的是破碎的，不再维护 BeautifulSoup3 或旧版本的bs4。

Answer 2

我已经下载了你的html并做了一些测试，beautifulsoup模块只能找到三个p节点。我想，那是因为html中有一些iframe，所以BS可能不起作用。我的建议是使用re模块而不是bs

示例代码供您参考：

import re

with open('1.html', 'r') as f:
    data = f.read()
    m=re.findall(r'(?<=<p class="commentsParagraph">)[\!\w\s.\'\,\-\(\)\@\#\$\%\^\&\*\+\=\/|\^<]+(?=</p>)', data)
    print(m)

beautifulsoup没有找到某个班级的所有p

2 个答案: