我需要解析未命名的br元素之间的某些文本,以获取某个类名称的span。在这个例子中,我需要0.36,它就在这个例子中的命名属性“DS”之后。
这是我尝试过的。
from bs4 import BeautifulSoup
html="""
<pre5 style="">
<br><br>
<span class="field-name">DS :</span>
0.36 [null]<br><br> <br> <span> <b>FC</b> </span><span> : 0.0 </span><br> <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br> <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br> <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
</pre5>
"""
soup = BeautifulSoup(html,'lxml')
divTag = soup.find_all("pre5", {"style":""})
for tag in divTag:
tdTags = tag.find_all("span", {"class":"field-name"})
for tag in tdTags:
print tag.text
# print DS :, but I want 0.36
#Alternatively,
soup = BeautifulSoup(html,'lxml')
print str(soup.span.next_sibling.strip()).replace('[null]','')
#prints 0.36 , but I would like to print by making sure that this element actually comes along with DS: and not just by the "immediate next sibilng" - is there a way to respect the named attribute DS and fetch the value for it ?
同样通过字符串解析/拆分/替换它会更慢,我可以直接使用树结构吗?
修改,在这种情况下,DS的值应为0.007。不能保证DS将成为span类中的第一个元素。
html="""
<pre5 style="">
<br><br>
<span class="field-name">FC :</span>
0.36 [null]<br><br> <br> <span> <b>DS:</b> </span><span> : 0.007 </span><br> <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br> <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br> <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
</pre5>
"""
答案 0 :(得分:2)
由于文字DS
可以位于<span>
或<b>
标记内,因此数据也可以位于<span>
标记内,您可以搜索标记像这样:
html = """
<pre5 style="">
<br><br>
<span class="field-name">DS :</span>
0.36 [null]<br><br> <br> <span> <b>FC</b> </span><span> : 0.0 </span><br> <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br> <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br> <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
</pre5>
<pre5 style="">
<br><br>
<span class="field-name">FC :</span>
0.36 [null]<br><br> <br> <span> <b>DS:</b> </span><span> : 0.007 </span><br> <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br> <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br> <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
</pre5>
"""
soup = BeautifulSoup(html, 'lxml')
divTag = soup.find_all("pre5", {"style": ""})
import re
for tag in divTag:
tdTags = tag.find_all(["span", "b"], text=re.compile(r'DS\s*:'))
for tag in tdTags:
if tag.nextSibling.strip():
print tag.nextSibling.replace('[null]', '').strip()
else:
print tag.findNext("span").text.replace(':', '').strip()
这将为您提供输出:
0.36
0.007
答案 1 :(得分:0)
如果我理解正确,您说要提取的文字就在<span>
标记之后,因此您可以使用next_element
两次。第一个用于标签内的文本,第二个用于文本之后的文本。它似乎在这里工作,像这样:
from bs4 import BeautifulSoup
html="""
<pre5 style="">
<br><br>
<span class="field-name">DS :</span>
0.36 [null]<br><br> <br> <span> <b>FC</b> </span><span> : 0.0 </span><br> <br> <span> <b>FDC</b> </span><span> : 0.36 </span><br> <br> <span> <b>LDD</b> </span><span> : 4838400000 </span><br> <br> <span> <b>IFS</b> </span><span> : 0.5333333 </span><br>
</pre5>
"""
soup = BeautifulSoup(html,'lxml')
divTag = soup.find_all("pre5", {"style":""})
for tag in divTag:
tdTags = tag.find_all("span", {"class":"field-name"})
for tag in tdTags:
print tag.next_element.next_element.replace('[null]', '')
它产生(有些空格可以在之后移除):
0.36