我正在使用beautifulsoup来查找我在本地保存的某个html页面中的所有p。 我的代码是
with open ("./" + str(filename) + ".txt", "r") as myfile:
data=myfile.read().replace('\n', '')
soup = BeautifulSoup(data)
t11 = soup.findAll("p", {"class": "commentsParagraph"})
这些代码适用于页面的一部分,但是页面的某些部分加载了ajax(我在保存源代码之前预先加载),并且代码无法正常工作。
为了对此进行测试,我添加了类p
的ajax部分中的commentsParagraph2
个标记之一,并将我的代码更改为
t11 = soup.findAll("p", {"class": "commentsParagraph2"})
但是t11是一个空列表。
我正在附加页面文件here
有什么想法吗?
答案 0 :(得分:1)
你的html中有一个带有commentsParagraph2类的p标签,bs4可以使用所有三个解析器找到没有问题:
In [8]: from bs4 import BeautifulSoup
...: soup1 = BeautifulSoup(open("/home/padraic
...: /t.html").read(),"html5lib")
...: soup2 = BeautifulSoup(open("/home/padraic
...: /t.html"),"html.parser")
...: soup3 = BeautifulSoup(open("/home/padraic
...: /t.html"),"lxml")
...: print(soup1.select_one("p.commentsParagraph2"))
...: print(soup2.select_one("p.commentsParagraph2"))
...: print(soup3.select_one("p.commentsParagraph2"))
...:
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
<p class="commentsParagraph2">
So much better than Ryder. Only take Econ 11 if she's one of the professors teaching it. Beware her tests though, which are much different from Ryder's.
</p>
因此,您使用的是破碎的,不再维护 BeautifulSoup3 或旧版本的bs4。
答案 1 :(得分:-1)
我已经下载了你的html并做了一些测试,beautifulsoup模块只能找到三个p节点。我想,那是因为html中有一些iframe,所以BS可能不起作用。我的建议是使用re
模块而不是bs
示例代码供您参考:
import re
with open('1.html', 'r') as f:
data = f.read()
m=re.findall(r'(?<=<p class="commentsParagraph">)[\!\w\s.\'\,\-\(\)\@\#\$\%\^\&\*\+\=\/|\^<]+(?=</p>)', data)
print(m)