为什么beautifulsoup会为此网页删除太多内容?

时间:2018-08-10 16:04:41

标签: beautifulsoup

我是Beautifulsoup的新手,正在尝试将网页的某些内容读入python。对于多个页面,效果很好。但是,对于这一特定的Beautifulsoup,我想扔掉很多文本,以供进一步处理。这是示例

html = requests.get(url).text

In [101]: from bs4 import BeautifulSoup

In [102]: import requests

In [103]: url = 'http://www.reuters.com/article/companyNewsAndPR/idUSTP13157220070102'

In [104]: html = requests.get(url).text

In [105]: soup = BeautifulSoup(html, features='xml')

In [106]: soup
Out[106]: 
<?xml version="1.0" encoding="utf-8"?>
<!--[if !IE]> This has been served from cache <![endif]--><!--[if !IE]> Request served from apache server: produs--i-0c9856522bc1925a7 <![endif]--><!--[if !IE]> Cached on Fri, 10 Aug 2018 13:08:25 GMT and will expire on Fri, 10 Aug 2018 13:23:24 GMT <![endif]--><!--[if !IE]> token: 8ba1c2fc-8894-48ea-ab7f-30d75c745528 <![endif]--><!--[if !IE]> App Server /produs--i-08940b2d65953b646/ <![endif]-->

许多文本被删除。 soup对象包含的内容要少得多,尤其是html中仍然存在的主要文本。之后,我想通过

阅读所有<p>
  text = list(soup.find_all('p'))

但是这给了我一个空列表,因为Beautifulsoup确实删除了所有这些部分。我该如何解决?

1 个答案:

答案 0 :(得分:1)

您需要选择正确的元素,在这种情况下为div.StandardArticleBody_body > p

from bs4 import BeautifulSoup
import requests

r = requests.get('http://www.reuters.com/article/companyNewsAndPR/idUSTP13157220070102')
soup = BeautifulSoup(r.text, 'lxml')

print(soup.h1.text)
print('-' * 80)
print()
for p in soup.select('div.StandardArticleBody_body > p'):
    print(p.text)

此打印:

UPDATE 1-TSMC plans five new advanced wafer plants -paper
--------------------------------------------------------------------------------

 (Adds TSMC’s comments)  
 TAIPEI, Jan 2 (Reuters) - TSMC (2330.TW) plans to build five new advanced 12-inch wafer plants on the island in the next few years, a local newspaper said on Tuesday, after a government move to allow companies to make more advanced chips in China.  

...and so on