html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a>
...
"""
我希望在第一次出现big
代码之前,在开始标记a
之间获取所有文字。这意味着如果我采用这个例子,那么我必须将(iterable)
作为字符串。
答案 0 :(得分:5)
迭代方法。
from BeautifulSoup import BeautifulSoup as bs
from itertools import takewhile, chain
def get_text(html, from_tag, until_tag):
soup = bs(html)
for big in soup(from_tag):
until = big.findNext(until_tag)
strings = (node for node in big.nextSiblingGenerator() if getattr(node, 'text', '').strip())
selected = takewhile(lambda node: node != until, strings)
try:
yield ''.join(getattr(node, 'text', '') for node in chain([big, next(selected)], selected))
except StopIteration as e:
pass
for text in get_text(html, 'big', 'a'):
print text
答案 1 :(得分:4)
我会避免使用nextSibling,因为从您的问题来看,您希望将所有内容都包括在下一个<a>
中,无论是在兄弟元素,父元素还是子元素中。
因此,我认为最好的方法是找到下一个<a>
元素的节点,然后递归循环,然后添加遇到的每个字符串。如果您的HTML与示例有很大不同,您可能需要整理下面的内容,但这样的事情应该有效:
from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
text += firstElement.string
if (firstElement.next.next == nextATag):
return text
else:
#Using double next to skip the string nodes themselves
return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString
答案 2 :(得分:1)
from BeautifulSoup import BeautifulSoup
html = """
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="test" title="Permalink to this definition"></a>
"""
soup = BeautifulSoup(html)
print soup.find('big').nextSibling.next.text
有关详细信息,请查看来自here
的BeautifulSoup的dom遍历答案 3 :(得分:0)
>>> from BeautifulSoup import BeautifulSoup as bs
>>> parsed = bs(html)
>>> txt = []
>>> for i in parsed.findAll('big'):
... txt.append(i.text)
... if i.nextSibling.name != u'a':
... txt.append(i.nextSibling.text)
...
>>> ''.join(txt)
u'(iterable)'