<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
我试图让整条线像
Soccer : <b>11</b>
到目前为止,我正在尝试使用此代码
for br in body.findAll('br'):
following = br.nextSibling
print following.strip()
但它只产生
Soccer:
Volley Ball:
Basketball:
Tennis:
答案 0 :(得分:1)
您可以使用您已经开始的类似方法或使用Set
来解决此问题。
选项#1
regular expression
选项#2
from bs4 import BeautifulSoup
html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""
body = BeautifulSoup(html, 'lxml')
between_br = []
for br in body.findAll('br'):
following = br.nextSibling
if following == '\n':
continue
sport = following.strip()
score = str(following.next_element)
combined = ' '.join((sport, score))
between_br.append(combined)
print '\n'.join(between_br)
两种方法都会打印出来:
import re
html = """
<br>
Soccer:
<b>11</b>
<br>
Volley Ball:
<b>5</b>
<br>
Basketball:
<b>5</b>
<br>
Tennis:
<b>2</b>
<br>
"""
sports_regex = re.compile(r"""
(?!<br>) # Skip <br> tag
(.* # Match any character
:\s # Match a colon followed by a whitespace
.*) # Match any character
""", re.VERBOSE)
sports = sports_regex.findall(html)
print '\n'.join([s.replace('\n', ' ') for s in sports])