beautifulsoup没有找到某个班级的所有p

时间:2016-10-10 00:35:05

标签: python beautifulsoup

我正在使用beautifulsoup来查找我在本地保存的某个html页面中的所有p。 我的代码是

with open ("./" + str(filename) + ".txt", "r") as myfile:
    data=myfile.read().replace('\n', '')
soup = BeautifulSoup(data)
t11 = soup.findAll("p", {"class": "commentsParagraph"})

这些代码适用于页面的一部分,但是页面的某些部分加载了ajax(我在保存源代码之前预先加载),并且代码无法正常工作。

为了对此进行测试,我添加了类p的ajax部分中的commentsParagraph2个标记之一,并将我的代码更改为

t11 = soup.findAll("p", {"class": "commentsParagraph2"})

但是t11是一个空列表。

我正在附加页面文件here

有什么想法吗?

2 个答案:

答案 0 :(得分:1)

你的html中有一个带有commentsParagraph2类的p标签,bs4可以使用所有三个解析器找到没有问题:

In [8]: from bs4 import BeautifulSoup
   ...: soup1 = BeautifulSoup(open("/home/padraic
   ...: /t.html").read(),"html5lib")
   ...: soup2 = BeautifulSoup(open("/home/padraic
   ...: /t.html"),"html.parser")
   ...: soup3 = BeautifulSoup(open("/home/padraic
   ...: /t.html"),"lxml")
   ...: print(soup1.select_one("p.commentsParagraph2"))
   ...: print(soup2.select_one("p.commentsParagraph2"))
   ...: print(soup3.select_one("p.commentsParagraph2"))
   ...: 
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>

因此,您使用的是破碎的,不再维护 BeautifulSoup3 或旧版本的bs4。

答案 1 :(得分:-1)

我已经下载了你的html并做了一些测试,beautifulsoup模块只能找到三个p节点。我想,那是因为html中有一些iframe,所以BS可能不起作用。我的建议是使用re模块而不是bs

示例代码供您参考:

import re

with open('1.html', 'r') as f:
    data = f.read()
    m=re.findall(r'(?<=<p class="commentsParagraph">)[\!\w\s.\'\,\-\(\)\@\#\$\%\^\&\*\+\=\/|\^<]+(?=</p>)', data)
    print(m)