使用beautufulsoup从div中搜集页面内容

时间:2017-08-03 10:08:35

标签: python web-scraping beautifulsoup

我正在尝试从每个div的http://www.indiainfoline.com/top-news中删除标题,摘要,日期和链接。与class' : 'row'

link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'row'})
for div in productDivs:
    result = {}
    try:
        import pdb
        #pdb.set_trace()
        heading = div.find('p', attrs={'class': 'heading fs20e robo_slab mb10'}).get_text()
        title = heading.get_text()
        article_link = "http://www.indiainfoline.com"+heading.find('a')['href']
        summary = div.find('p')

但是没有任何组件被提取。关于如何解决这个问题的任何建议?

2 个答案:

答案 0 :(得分:2)

请参阅html源代码中有许多class=row,您需要过滤掉存在实际行数据的部分块。在id="search-list"下的情况下,存在所有16个预期行。因此首先提取部分然后排。由于.select返回数组,我们必须使用[0]来提取数据。一旦获得行数据,您需要迭代并提取标题,articl_url,摘要等。

from bs4 import BeautifulSoup
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
section = soup.select('#search-list')
rowdata = section[0].select('.row')

for row in rowdata[1:]:
    heading = row.select('.heading.fs20e.robo_slab.mb10')[0].text
    title = 'http://www.indiainfoline.com'+row.select('a')[0]['href']
    summary = row.select('p')[0].text

输出:

PFC board to consider bonus issue; stock surges by 4%     
http://www.indiainfoline.com/article/news-top-story/pfc-pfc-board-to-consider-bonus-issue-stock-surges-by-4-117080300814_1.html
PFC board to consider bonus issue; stock surges by 4%
...
...

答案 1 :(得分:1)

试试这个

from bs4 import BeautifulSoup
from urllib.request import urlopen 

link = 'http://www.indiainfoline.com/top-news'
soup = BeautifulSoup(urlopen(link),"lxml")
fixed_html = soup.prettify()

ul = soup.find('ul', attrs={'class':'row'})
print(ul.find('li'))

你会得到

<li class="animated" onclick="location.href='/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html';">
<div class="row">
<div class="col-lg-9 col-md-9 col-sm-9 col-xs-12 ">
<p class="heading fs20e robo_slab mb10"><a href="/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html">Lupin gets USFDA nod to market Rosuvastatin Calcium</a></p>
<p><!--style="color: green !important"-->
<img class="img-responsive visible-xs mob-img" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
                                            Pharma major, Lupin announced on Thursday that the company has received the United States Food and Drug Administra...
                                                                        </p>
<p class="source fs12e">India Infoline News Service |                                           
                                            Mumbai                          15:42 IST |                                          August 03, 2017                 </p>
</div>
<div class="col-lg-3 col-md-3 col-sm-3 hidden-xs pl0 listing-image">
<img class="img-responsive" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
</div>
</div>
</li>

当然,你可以打印fixed_html来获取整个网站的内容。