在我要抓取的网站(https://codeforces.com/contest/1352/problem/D)的某个部分中,有这样的代码
<p>
Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
<span class="tex-font-style-bf">from left to right</span>
, and Bob —
<span class="tex-font-style-bf">from right to left</span>
. The game ends if all the candies are eaten.
</p>
我想在'p'标记内获取文本。为此,我使用了此python代码
source_code = raw_html.find('p') # Reading the html
text = source_code.get_text('\n') # Getting all the text from the 'p' tags
text = text.replace(" ", " ") # Replacing the tabs with single white space
print(text)
(我正在使用BeautifulSoup4)
这是我期望的结果:
Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy from left to right, and Bob — from right to left. The game ends if all the candies are eaten.
但是我的输出最终看起来像这样:
Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
from left to right
, and Bob —
from right to left
. The game ends if all the candies are eaten.
我所知道的是,此问题是由“ p”标签内的“ span”标签引起的。 如何正确格式化代码?更确切地说,如何摆脱由span标签引起的换行符?
答案 0 :(得分:1)
这不是最优雅的方法,但是您可以通过一些列表理解和文本操作到达那里:
final_text = ' '.join([item for item in source_code.text.replace('\n','').split(' ') if len(item)>0])
print(final_text)