在python中正确设置网页抓取文本的格式

时间:2020-05-22 01:47:27

标签: python html web-scraping beautifulsoup

在我要抓取的网站(https://codeforces.com/contest/1352/problem/D)的某个部分中,有这样的代码

<p>
Alice and Bob play an interesting and tasty game: they eat candy.       Alice will eat candy 
    <span class="tex-font-style-bf">from left to       right</span>
, and Bob — 
    <span class="tex-font-style-bf">from right         to left</span>
. The game ends if all the candies are eaten.
</p>

我想在'p'标记内获取文本。为此,我使用了此python代码

source_code = raw_html.find('p') # Reading the html
text = source_code.get_text('\n') # Getting all the text from the 'p' tags
text = text.replace("    ", " ") # Replacing the tabs with single white space
print(text)

(我正在使用BeautifulSoup4)

这是我期望的结果:

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy from left to right, and Bob — from right to left. The game ends if all the candies are eaten.

但是我的输出最终看起来像这样:

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
from left to right
, and Bob —
from right to left
. The game ends if all the candies are eaten.

我所知道的是,此问题是由“ p”标签内的“ span”标签引起的。 如何正确格式化代码?更确切地说,如何摆脱由span标签引起的换行符?

1 个答案:

答案 0 :(得分:1)

这不是最优雅的方法,但是您可以通过一些列表理解和文本操作到达那里:

final_text = ' '.join([item for item in source_code.text.replace('\n','').split(' ') if len(item)>0]) 
print(final_text)