如何删除BeautifulSoup输出中的冗余空间

时间:2016-10-22 10:10:01

标签: beautifulsoup space plaintext

我打算使用BeautifulSoup抓一个网站。我正在处理以下HTML:

html = 
<div id="article-body" itemprop="articleBody">
<p>
    <span class="quote down bgQuote" data-channel="/quotes/zigman/170324/composite" data-bgformat="">
    <a class="qt-chip trackable" data-fancyid="XNYSStockSLB" href="/investing/stock/slb?mod=MW_story_quote" data-track-mod="MW_story_quote">
    SLB,
    <span class="bgPercentChange">-3.04%</span>
    </a>
    </span>
    reported late Thursday
    <a href="/story/schlumberger-profit-falls-sharply-2016-10-20-174854654" class="icon none">higher third-quarter profit that beat targets and sales only slightly below estimates</a>
    . Schlumberger’s results came a day after rival Halliburton Co.
    <span class="quote down bgQuote" data-channel="/quotes/zigman/228631/composite" data-bgformat="">
    <a class="qt-chip trackable" data-fancyid="XNYSStockHAL" href="/investing/stock/hal?mod=MW_story_quote" data-track-mod="MW_story_quote">
    HAL,
    <span class="bgPercentChange">-0.66%</span>
    </a> """

我想得到一个没有任何多余空间的纯文本,我按照Twig的回答,但是SLB和-3.04%以及HAL和-0.66%仍然放在不同的行中。我的有利输出将是喜欢:

 SLB, -3.04%  reported late Thursday higher third-quarter profit that beat targets and sales only slightly below estimates. Schlumberger’s results came a day after rival Halliburton Co. HAL, -0.66%  also posted higher-than-expected profit.

这是我的代码:

import urllib2
from bs4 import BeautifulSoup
import re
newsText = soap(html)
text = list(newsText.stripped_strings)
finalText = "\n\n".join(text) if descriptions else ""
re.sub(r'[\ \n]{2,}', '', finalText)
print finalText

我非常感谢。

1 个答案:

答案 0 :(得分:2)

SLB,  -3.04%  reported late Thursday  higher third-quarter profit that beat targets and sales only slightly below estimates  . Schlumberger’s results came a day after rival Halliburton Co.  HAL,  -0.66%

出:

// Instance property on your UIViewController
private var imageAlreadyDownloaded = false

// Somewhere else in your UIViewController...
imageView.loadInBackground() { 
    [unowned self] (image, error) in

    guard error == nil else {
        return
    }
    self.imageAlreadyDownloaded = true
}