Beautifulsoup提取BR之间的字符串,但包括<b>字符串</b>

时间:2015-10-25 08:57:19

标签: beautifulsoup

<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>

我试图让整条线像

Soccer : <b>11</b>

到目前为止,我正在尝试使用此代码

for br in body.findAll('br'):
    following = br.nextSibling
    print following.strip()

但它只产生

Soccer:
Volley Ball:
Basketball:
Tennis:

1 个答案:

答案 0 :(得分:1)

您可以使用您已经开始的类似方法或使用Set来解决此问题。

选项#1

regular expression

选项#2

from bs4 import BeautifulSoup


html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""

body = BeautifulSoup(html, 'lxml')

between_br = []
for br in body.findAll('br'):
    following = br.nextSibling

    if following == '\n':
        continue

    sport = following.strip()
    score = str(following.next_element)

    combined = ' '.join((sport, score))
    between_br.append(combined)

print '\n'.join(between_br)

两种方法都会打印出来:

import re


html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""

sports_regex = re.compile(r"""
 (?!<br>)  # Skip <br> tag
 (.*       # Match any character
 :\s       # Match a colon followed by a whitespace
 .*)       # Match any character
""", re.VERBOSE)

sports = sports_regex.findall(html)
print '\n'.join([s.replace('\n', ' ') for s in sports])