Question

在我要抓取的网站（https://codeforces.com/contest/1352/problem/D）的某个部分中，有这样的代码

<p>
Alice and Bob play an interesting and tasty game: they eat candy.       Alice will eat candy 
    <span class="tex-font-style-bf">from left to       right</span>
, and Bob — 
    <span class="tex-font-style-bf">from right         to left</span>
. The game ends if all the candies are eaten.
</p>

我想在'p'标记内获取文本。为此，我使用了此python代码

source_code = raw_html.find('p') # Reading the html
text = source_code.get_text('\n') # Getting all the text from the 'p' tags
text = text.replace("    ", " ") # Replacing the tabs with single white space
print(text)

（我正在使用BeautifulSoup4）

这是我期望的结果：

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy from left to right, and Bob — from right to left. The game ends if all the candies are eaten.

但是我的输出最终看起来像这样：

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
from left to right
, and Bob —
from right to left
. The game ends if all the candies are eaten.

我所知道的是，此问题是由“ p”标签内的“ span”标签引起的。如何正确格式化代码？更确切地说，如何摆脱由span标签引起的换行符？

Answer 1

这不是最优雅的方法，但是您可以通过一些列表理解和文本操作到达那里：

final_text = ' '.join([item for item in source_code.text.replace('\n','').split(' ') if len(item)>0]) 
print(final_text)

在python中正确设置网页抓取文本的格式

1 个答案: