我正在尝试抓取网站,而我的示例html如下所示
<div class="ism-true"><!-- message -->
<div id="post_message_5437898" data-spx-slot="1">
OK, although it's been several weeks since I installed the
<div><label>Quote:</label></div>
<div class="panel alt2" style="border:1px inset">
<div>
Originally Posted by <strong>DeltaNu1142</strong>
</div>
<div style="font-style:italic">The very first thing I did </div>
</div>
</div>When I got my grille back from the paint shop, I went to work on the
</div>
<!-- / message --></div>
<div class="ism-true"><!-- message -->
<div id="post_message_5125716">
<div style="margin:1rem; margin-top:0.3rem;">
<div><label>Quote:</label></div>
<div class="panel alt2" style="border:1px inset">
<div>
Originally Posted by <strong>HCFX2013</strong>
</div>
<div style="font-style:italic">I must be the minority that absolutely can't .</div>
</div>
</div>Hello World.
</div>
<!-- / message --></div>
我想要仅在帖子类中的文本,而不在“ panel alt2”类中的文本。类在“ div id =“ post_message_”中的位置不断变化。如何忽略面板alt2类中的文本。
我的代码。
text = []
for item in soup.findAll('div',attrs={"class":"ism-true"}):
result = [item.get_text(strip=True, separator=" ")]
div = item.find('div', class_="panel alt2")
if div :
result[0] = ' '.join(result[0].split(div.text.split()[-1])[1:])
text.append(result[0])
else:
text.append(result)
上面的代码仅在div类中的“ Panel alt2”为第一类时给我文本。如果类的位置发生变化并且将“列表索引超出范围”错误提示给我,则它的效果不佳。您能帮我忽略这些课程吗? 预期结果是
[OK, although it's been several weeks. When I got my grille back from the paint shop, I went to work on the],[Hello world]
答案 0 :(得分:1)
一种可行的方法是使用类panel alt2
和标签label
来extract
除去div。以下代码似乎可以在该网站以及您的示例html上正常工作。
import requests
from bs4 import BeautifulSoup
URL = 'https://www.f150forum.com/f118/fab-fours-black-steel-elite-bumper-adaptive-cruise-relocation-bracket-387234/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
text = []
for div in soup.find_all('div', class_="ism-true"):
try:
div.find('div', class_="panel alt2").extract()
except AttributeError:
pass # sometimes there is no 'panel alt2'
try:
div.find('label').extract()
except AttributeError:
pass # sometimes there is no 'Quote'
text.append(div.text.strip())
print(text)
输出您的样本:
["OK, although it's been several weeks since I installed the \n\n \n\nWhen I got my grille back from the paint shop, I went to work on the", 'Hello World.']
如果不需要,您可以删除换行符字符