我有以下html代码:
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div><a href="https://google.com">Text10</a></div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>
如何找到Text3
元素后紧跟div
字符串的MyText
?
答案 0 :(得分:1)
您可以使用lxml.html
解决方案:
from lxml import html
source = """
<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
...
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
tree = html.fromstring(source)
print(tree.xpath('//div[.="MyText"]/following-sibling::span/div/span/text()'))
答案 1 :(得分:0)
只有当您的结构是最终结构时,您才能做到这一点。
from bs4 import BeautifulSoup as bfs
html = """<div class="xyOfqd">
<div class="aAAD">
<div class="Bgbcca">Updated</div>
<span class="hthtb">
<div>
<span class="hthtb">September 30, 2018</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text1</div>
<span class="hthtb">
<div><span class="hthtb">Text2</span></div>
</span>
</div>
<div
class="aAAD">
<div class="Bgbcca">MyText</div>
<span class="hthtb">
<div>
<span class="hthtb">Text3</span>
</div>
</span>
</div>
<div class="aAAD">
<div class="Bgbcca">Text4</div>
<span class="hthtb">
<div><span
class="hthtb">Text5</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text6</div>
<span class="hthtb">
<div><span
class="hthtb">Text7</span></div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">
Text8/div>
<span class="hthtb">
<div>
<span class="hthtb">
<div>Text9</div>
<div><a href="https://google.com">Text10</a></div>
</span>
</div>
</span>
</div>
<div class="aAAD">
<div
class="Bgbcca">Text11</div>
<span class="hthtb">
<div><span class="hthtb">Text12</span></div>
</span>
</div>"""
soup = bfs(html, 'html.parser')
result = ''
for div0 in soup.find_all('div',{'class':'aAAD'}):
for div1 in div0.find_all('div', {'class':'Bgbcca'}):
if div1.get_text() == 'MyText':
span = div0.find('span',{'class':'hthtb'})
if span:
span_to_return = span.find('span',{'class':'hthtb'})
if span_to_return:
result = span_to_return.get_text()
print(result)
答案 2 :(得分:0)
您可以构建自定义查询函数以传递到find()
:
def has_my_text(tag):
found = tag.select_one('.Bgbcca')
# important to assign the result to avoid calling
# .get_text() on a NoneType, resulting in an error.
if found:
return found.get_text() == "MyText"
soup = bs4.... # assign your soup object
found = soup.find(has_my_text)
# <div class="Bgbcca">MyText</div>
# <span class="hthtb">
# <div>
# <span class="hthtb">Text3</span>
# </div>
# </span>
# </div>
# Note your span class is nested so we go two level in
result = found.select_one('.hthtb').select_one('.hthtb').get_text()
# 'Text3'
# This below also works if your other span are always empty texts
result = found.select_one('.hthtb').get_text().strip()
请注意,find()
和select_one
假设我们只需要找到第一个匹配项。如果需要处理多个匹配项,则需要使用find_all()
和select()
并相应地更改代码。
如果要处理可变文本,可以这样定义函数:
def has_my_text(tag, text):
found = tag.select_one('.Bgbcca')
if found:
return found.get_text() == text
将函数包装在find()
中,如下所示:
txt = "MyText"
soup.find(lambda tag: has_my_text(tag, txt))