这是HTML Im正在使用:
<div id="post_message_64012736" class=" post">
<br>
Just testing something, please ignore this :D<br>
<br>
<br>
<br>
<br>
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">
Quote:
</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tbody><tr><td class="quotearea">
<div style="font-style:italic">New browser based game that was directly inspired by Candy Box, but is quite different from it.<br>
<br>
A Dark Room -</div>
</td>
</tr>
</tbody></table>
</div>
I have it running on a tab, pretty interesting. I still don't know how to get scales thought. You can only buy them or get them from the traps?<br>
<br>
Is there a Sentinel demo that doesn't require unity3d in the browser? Like a real windows demo?
</div>
这是我使用的代码,非常简单:
soup = bs4.BeautifulSoup(r.text)
for i in soup.findAll("div",class_=" post"):
print i.text
但我只得到这个输出:
Just testing something, please ignore this :D
Quote:
New browser based game that was directly inspired by Candy Box, but is quite different from it.
A Dark Room -
如果我只打印i,我会得到这个:
<div class=" post" id="post_message_64012736">
INFO:pyindiegaf<br/>
<br/>
Just testing something, please ignore this :D<br/>
<br/>
<br/>
<br/>
<br/>
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">
Quote:
</div>
<table border="0" cellpadding="6" cellspacing="0" width="100%">
<td class="quotearea">
<div style="font-style:italic">New browser based game that was directly inspired by Candy Box, but is quite different from it.<br/>
<br/>
A Dark Room -</div>
</td>
</table></div></div>
看起来在找到X标签后,它只是认为它是主要div的结尾。我看到,每个open都有一个密切的标签,所以它不像html格式错误。
那么......任何猜测都可能发生在这里?我感到愚蠢,就像我在那里错过了什么?
谢谢!
编辑:我不是真的只使用那部分HTML,一些澄清因为纯粹的html似乎有效。我正在使用此网址:http://www.neogaf.com/forum/showthread.php?t=572913&page=12
是一个vBulleting论坛,所以所有的帖子都有一个“post”类。 我正在用bs4寻找它们,如果它们有一个关键字,我将开始像这样处理它们:
url = "http://www.neogaf.com/forum/showthread.php?t=572913&page=12"
r = requests.get(url)
print "Using url:", url
soup = bs4.BeautifulSoup(r.text)
for i in soup.findAll("div",class_=" post"):
if "INFO:pyindiegaf" in i.text:
print i
使用这种方法,我得到了上面提到的结果,bs4在结束整个div块之前停止。
对于这种混乱感到抱歉,试图简化它。
答案 0 :(得分:1)
网站似乎有一些格式错误的HTML干扰了实际的解析。安装html5lib
(pip install html5lib
)并将其用作HTML解析器:
import requests
from bs4 import BeautifulSoup
url = 'http://www.neogaf.com/forum/showthread.php?t=572913&page=12'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html5lib')
for post in soup.find_all('div', class_='post'):
text = post.get_text()
if 'INFO:pyindiegaf' in text:
print(text)
这是您可以获得的最宽松的HTML解析器。此外,class_='post'
和class_=' post'
会产生不同的结果。
由于您正在搜索论坛,因此您可能希望使用Scrapy。它看起来很复杂,但蜘蛛会比你的BeautifulSoup爬虫更简单,更快(如果你实际上正在抓取论坛)。