我有一个看起来像这样的html:
<div class="content-container">
<h2>Description</h2>
<pre>Manage the wine production and review the production pipeline and volumes.</pre>
<h2>Alternative label</h2>
<ul>
<li><p>managing production of wine</p></li>
<li><p>supervising wine production</p></li>
<li><p>wine production managing</p></li>
<li><p>supervising production of wine</p></li>
<li><p>supervise wine production</p></li>
<li><p>wine production supervising</p></li>
<li><p>managing wine production</p></li>
</ul>
<h2>Skill type</h2>
<ul>
我想做的是收集<h2>Alternative label</h2>
中存在的所有li元素。这是我到目前为止的代码片段:
somehtmlContent =BeautifulSoup(somehtml.content,"lxml")
for item in somehtmlContent.find_all("div", {"class": "content-container"}):
try:
altlabel = item.find(text="Alternative label")
h2tag = altlabel.parent
ultag = h2tag.findNext('ul')
litags = []
for litag in ultag:
litags.append(litag.findNext('p').text)
for tag in litags:
print(tag)
except:
pass
尽管我打印litags
列表的内容时,却看到所有实体打印两次,如下所示:
managing production of wine
managing production of wine
supervising wine production
supervising wine production
wine production managing
wine production managing
supervising production of wine
supervising production of wine
supervise wine production
supervise wine production
wine production supervising
wine production supervising
managing wine production
managing wine production
有人可以帮助我了解为什么会这样吗?感谢您能提供的任何帮助
答案 0 :(得分:1)
您当前在问题中带有HTML代码段的代码不会显示任何内容-以异常处理程序结尾。问题在这里:
for litag in ultag:
litags.append(litag.findNext('p').text)
您正在有效地遍历ultag.contents
,该变量保存着所有标记和NavigableStrings
。要解决此问题,请仅对<p>
个标记进行迭代:
data = '''<div class="content-container">
<h2>Description</h2>
<pre>Manage the wine production and review the production pipeline and volumes.</pre>
<h2>Alternative label</h2>
<ul>
<li><p>managing production of wine</p></li>
<li><p>supervising wine production</p></li>
<li><p>wine production managing</p></li>
<li><p>supervising production of wine</p></li>
<li><p>supervise wine production</p></li>
<li><p>wine production supervising</p></li>
<li><p>managing wine production</p></li>
</ul>
<h2>Skill type</h2>
<ul>'''
from bs4 import BeautifulSoup
somehtmlContent =BeautifulSoup(data,"lxml")
for item in somehtmlContent.find_all("div", {"class": "content-container"}):
try:
altlabel = item.find(text="Alternative label")
h2tag = altlabel.parent
ultag = h2tag.findNext('ul')
litags = []
for p in ultag.find_all('p'):
litags.append(p.text)
for tag in litags:
print(tag)
except:
pass
打印:
managing production of wine
supervising wine production
wine production managing
supervising production of wine
supervise wine production
wine production supervising
managing wine production
编辑:获取内容的更简短方法如下:soup.select('h2:contains("Alternative label") + ul p')
-此选择<h2>
包含“替代标签”,首先是<ul>
,所有{{1} }在其中:
<p>
答案 1 :(得分:1)
@objc func advanceTelemetry() {
// print("current Letter Index: \(self.currentLetterIndex)")
if self.currentLetterIndex < currentMessage.count && onPrologue {
let currentIndex = self.currentMessage.index(currentMessage.startIndex, offsetBy: currentLetterIndex)
let newletter = self.currentMessage[currentIndex]
transmissionView.text?.append(newletter)
self.currentLetterIndex = self.currentLetterIndex + 1
}
else
{
telemetryTimer?.invalidate()
// other end of message cleanup code
}
}
中有一个小错误。而是使用for litag in ultag
。 for litag in ultag.find_all('li')
时,您的代码中有一些空行。对于每个空行,将附加下一个for litag in ultag
标签。这就是代码中重复的原因。 <p>