使用BeautifulSoup提取<p> </p>的文本

时间:2017-08-15 17:46:42

标签: python-3.x beautifulsoup

我正试图从this链接获取新闻文章。我的代码是:

def get_news_details(news_url):
    source = requests.get(news_url)
    plain_text = source.text
    soup = BeautifulSoup(plain_text, "html.parser")
    content = soup.findAll('div', {'class' : 'big-img-box'})
    print(content[0].findAll('p'))

结果显示:

[<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>]

content的值是:

<div class="big-img-box">
<div class="left-imgs">
<figure>
<img alt="iOS developer hints possibility of 4K Apple TV" class="img-responsive" src="http://www.aninews.in/contentimages/detail/appletv.jpg"/>
<figcaption><span class="heading-inner-span"></span></figcaption>
</figure>
<div class="mb10"></div>
</div>
<p></p>      New York [USA], August 6 <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a>: The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/4k-apple-tv.html"> 4K Apple TV</a></span> with high dynamic range (HDR)  support for both <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr10.html"> HDR10  </a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/dolby-vision.html"> Dolby Vision</a></span>.<p></p>      While the current range of Apple's TV set-top box is incompatible to 4K technology, <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/ios.html">iOS</a></span> developer <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/guilherme-rambo.html"> Guilherme Rambo</a></span> revealed that the company is hinting an adoption of the ultra high-definition format, reports <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/the-verge.html">The Verge</a></span>.<p></p>      Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.<p></p>      It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/netflix.html"> Netflix</a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/amazon.html"> Amazon</a></span> support the two high-definition formats.<p></p>      Last month, iTunes started listing movies as supporting 4K and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr.html"> HDR</a></span> in users' purchase histories, thus providing more thrust to the speculations of the 4K <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/apple.html"> Apple</a></span> TV. <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a><p></p>
</div>

我可以通过content[0].text得到一篇有点笨拙的文章版本,但我无法对其进行格式化。

在检查chrome中的网页时,文章似乎是在<p>article_text</p>标记内写的。而在content中,它显示为<p></p>article_text标记。如果soup中存在前一版本,我可以得到我想要的输出。该怎么办?

1 个答案:

答案 0 :(得分:2)

这取决于格式化的含义。你可以让它更整洁。以相当简单的方式。

.as-console-wrapper { max-height: 100% !important; top: 0; }

获取所有文字并删除空格。

>>> import bs4
>>> import requests
>>> page = requests.get('http://www.aninews.in/newsdetail-Nw/MzI4NDIy/ios-developer-hints-possibility-of-4k-apple-tv.html').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> big_img_box = soup.select('.big-img-box')

超越此范围,删除更长的内部空白字符串。

>>> big_img_box[0].text.strip()
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a  4K Apple TV with high dynamic range (HDR)  support for both  HDR10   and  Dolby Vision.      While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer  Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge.      Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.      It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like  Netflix and  Amazon support the two high-definition formats.      Last month, iTunes started listing movies as supporting 4K and  HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K  Apple TV. (ANI)"