如何仅从HTML获取新闻部分?

时间:2019-02-19 09:26:57

标签: html beautifulsoup

我是编程新手 但是我只需要新闻文章,有没有一种简便的方法可以从文本中删除不必要的HTML,因为我必须进一步遍历许多链接,然后对它们进行情感分析。

p = 'https://www.moneycontrol.com/news/business/earnings/cadila-health-consolidated-december-2018-net-sales-at-rs-3577-90-crore-up-9-77-y-o-y-3497711.html'
html = requests.get(p)
    soup1 = BeautifulSoup(html.text,'html.parser')
    date = soup1.find_all("div", {"class":"arttidate"})
    print(date)
    article = soup1.find_all("script", {"class":"arti-flow"})
    print(article)

输出如下

[ < div class = "arttidate " > Last Updated: Feb 07, 2019 03: 05 PM IST | Source: < span > Moneycontrol.com < /span></div > ]
[ < div class = "arti-flow"
    id = "article-main" >
    <!-- .CONTENT BODY -->
    <
    p > < div class = "top_dis"
    id = "div_app_container" > < b > Reported Consolidated quarterly numbers
    for Cadila Healthcare are: < /b></div > < /p><p>Net Sales at Rs 3,577.90 crore in December 2018 up 9.77% from Rs. 3,259.60 crore in December 2017.</p > < p > Quarterly Net Profit at Rs.510.70 crore in December 2018 down 6 % from Rs.543.30 crore in December 2017. < /p><div class="ads-320-250 show-moblie mid-arti-ad"><div id="Moneycontrol_Mobile_WAP/MC_WAP_News / MC_WAP_News_Internal_300x250_Middle_2 "> <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_300x250_Middle_2";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_300x250_Middle_2")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_300x250_Middle_2")
        });
    }

    <
    /script> <
    /div></div > < div class = "hide-moblie mid-arti-ad" > < div id = "Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream" >
    <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream")
        });
    }

    <
    /script> <
    /div></div > < script >
    date = new Date();
    date.setTime(date.getTime() + (1 * 24 * 60 * 60 * 1000));
    $.cookie("dfp_cookie_article", "Y1", {
        expires: date,
        path: "/",
        domain: ".moneycontrol.com"
    }); < /script><p>EBITDA stands at Rs. 870.90 crore in December 2018 down 1.29% from Rs. 882.30 crore in December 2017.</p > < div class = "hide-moblie mid-arti-ad" > < div id = "Moneycontrol/MC_News/MC_News_Internal_Article_Native" >
    <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol/MC_News/MC_News_Internal_Article_Native";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_Article_Native")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_Article_Native")
        });
    }

    <
    /script> <
    /div></div > < div class = "hide-moblie mid-arti-ad" > < div id = "Moneycontrol/MC_News/MC_News_Internal_OutStream" >
    <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol/MC_News/MC_News_Internal_OutStream";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_OutStream")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_OutStream")
        });
    }

    <
    /script> <
    /div></div > < script >
    date = new Date();
    date.setTime(date.getTime() + (1 * 24 * 60 * 60 * 1000));
    $.cookie("dfp_cookie_article", "Y1", {
        expires: date,
        path: "/",
        domain: ".moneycontrol.com"
    }); < /script><p>Cadila Health EPS has decreased to Rs. 4.99 in December 2018 from Rs. 5.31 in December 2017.</p > < p > Cadila Health shares closed at 317.95 on February 06, 2019(NSE) and has given - 16.39 % returns over the last 6 months and - 21.40 % over the last 12 months. < /p></div >
]

实际期望的结果是:- 2018年12月的净销售额为3,577.90千万卢比,比卢比增长9.77%。 2017年12月为3,259.60千万卢比。

季度净利润为卢比。 2018年12月,510.70千万卢比比卢比下降6%。 2017年12月为54.330亿卢比。EBITDA为卢比。 2018年12月为870.90亿卢比,比卢比下降1.29%。 2017年12月为882.30千万卢比。 2018年12月为4.99卢比起。 2017年12月为5.31。

Cadila Health股票于2019年2月6日(NSE)收于317.95,在过去6个月中的回报率为-16.39%,在过去12个月中的回报率为-21.40%。

>

编辑:在编写此输出时,我意识到我想要的所有新闻都包含在“ p”标记中,因此我将不得不将新闻文章抓到另一个对象中,并且只读取“ p”标记,有人可以指导我吗?谁可以去做?

2 个答案:

答案 0 :(得分:3)

我认为您只希望文本位于不同的<p>标签内

为此,您可以找到所有<p>标记并在其上应用get_text()

p = 'https://www.moneycontrol.com/news/business/earnings/cadila-health-consolidated-december-2018-net-sales-at-rs-3577-90-crore-up-9-77-y-o-y-3497711.html'

html = requests.get(p)
soup1 = BeautifulSoup(html.text,'html.parser')

para = soup1.find_all('p')

result = []
for p in para:
    result.append(p.get_text())

print(result)

输出将是:

['Reported Consolidated quarterly numbers for Cadila Healthcare are:',
 'Net Sales at Rs 3,577.90 crore in December 2018 up 9.77% from Rs. 3,259.60 '
 'crore in December 2017.',
 'Quarterly Net Profit at Rs. 510.70 crore in December 2018 down 6% from Rs. '
 '543.30 crore in December 2017.',
 'EBITDA stands at Rs. 870.90 crore in December 2018 down 1.29% from Rs. '
 '882.30 crore in December 2017.',
 'Cadila Health EPS has decreased to Rs. 4.99 in December 2018 from Rs. 5.31 '
 'in December 2017.',
 'Cadila Health shares closed at 317.95 on February 06, 2019 (NSE) and has '
 'given -16.39% returns over the last 6 months and -21.40% over the last 12 '
 'months.',
 'Podcast | NSE Invest O Cast episode 5: Harsh Roongta on the benefits of SIP',
 ' Copyright © e-Eighteen.com Ltd. All rights reserved. Reproduction of news '
 'articles, photos, videos or any other content in whole or in part in any '
 'form \r\n'
 '        or medium without express writtern permission of moneycontrol.com is '
 'prohibited.',
 '\n'
 ' Copyright © e-Eighteen.com Ltd All rights resderved. Reproduction of news '
 'articles, photos, videos or any other content in whole or in part in any '
 'form or medium without express writtern permission of moneycontrol.com is '
 'prohibited.\r\n'
 '\t\t']

您最终可以跳过其中一些或对其应用正则表达式

答案 1 :(得分:2)

您还可以在<script>标签中获取该json格式。

import requests
import bs4
import json

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

p = 'https://www.moneycontrol.com/news/business/earnings/cadila-health-consolidated-december-2018-net-sales-at-rs-3577-90-crore-up-9-77-y-o-y-3497711.html'
html = requests.get(p, headers=headers)
soup1 = bs4.BeautifulSoup(html.text,'html.parser')
date = soup1.find_all("div", {"class":"arttidate"})
print(date)
scripts = soup1.find_all("script", {'type':'application/ld+json'})

jsonObj = None

for script in scripts:
    if "articleBody" in script.text:
        jsonStr = script.text.strip()
        jsonObj = json.loads(jsonStr, strict=False)

        article = jsonObj[0]['articleBody']

print(article)

输出:

Reported Consolidated quarterly numbers for Cadila Healthcare are:

Net Sales at Rs 3,577.90 crore in December 2018 up 9.77% from Rs. 3,259.60 crore in December 2017.

Quarterly Net Profit at Rs. 510.70 crore in December 2018 down 6% from Rs. 543.30 crore in December 2017.

EBITDA stands at Rs. 870.90 crore in December 2018 down 1.29% from Rs. 882.30 crore in December 2017.

Cadila Health EPS has decreased to Rs. 4.99 in December 2018 from Rs. 5.31 in December 2017.

Cadila Health shares closed at 317.95 on February 06, 2019 (NSE) and has given -16.39% returns over the last 6 months and -21.40% over the last 12 months.









Cadila Healthcare


Consolidated Quarterly Results
in Rs. Cr.











Dec'18
Sep'18
Dec'17


Net Sales/Income from operations
3,516.10
2,844.10
3,191.80


Other Operating Income
61.80
117.10
67.80


Total Income From Operations
3,577.90
2,961.20
3,259.60


EXPENDITURE


Consumption of Raw Materials
590.50
658.30
661.00


Purchase of Traded Goods
620.50
465.10
495.90


Increase/Decrease in Stocks
141.20
-131.50
-32.30


Power &amp;amp;amp; Fuel
--
--
--


Employees Cost
524.00
521.20
460.80


Depreciation
153.70
147.50
147.30


Excise Duty
--
--
--


Admin. And Selling Expenses
--
--
--


R &amp;amp;amp; D Expenses
--
--
--


Provisions And Contingencies
--
--
--


Exp. Capitalised
--
--
--


Other Expenses
861.80
760.30
833.00


P/L Before Other Inc., Int., Excpt. Items &amp;amp;amp; Tax
686.20
540.30
693.90


Other Income
31.00
30.40
41.10


P/L Before Int., Excpt. Items &amp;amp;amp; Tax
717.20
570.70
735.00


Interest
45.50
35.70
13.50


P/L Before Exceptional Items &amp;amp;amp; Tax
671.70
535.00
721.50


Exceptional Items
--
--
--


P/L Before Tax
671.70
535.00
721.50


Tax
158.60
124.70
178.60


P/L After Tax from Ordinary Activities
513.10
410.30
542.90


Prior Year Adjustments
--
--
--


Extra Ordinary Items
--
--
--


Net Profit/(Loss) For the Period
513.10
410.30
542.90


Minority Interest
-10.90
-10.70
-10.10


Share Of P/L Of Associates
8.50
17.90
10.50


Net P/L After M.I &amp;amp;amp; Associates
510.70
417.50
543.30


Equity Share Capital
102.40
102.40
102.40


Reserves Excluding Revaluation Reserves
--
--
--


Equity Dividend Rate (%)
--
--
--


EPS Before Extra Ordinary


Basic EPS
4.99
4.08
5.31


Diluted EPS
4.99
4.08
5.31


EPS After Extra Ordinary


Basic EPS
4.99
4.08
5.31


Diluted EPS
4.99
4.08
5.31


Public Share Holding


No Of Shares (Crores)
--
--
--


Share Holding (%)
--
--
--


Promoters and Promoter Group Shareholding


a) Pledged/Encumbered


- Number of shares (Crores)
--
--
--


- Per. of shares (as a % of the total sh. of prom. and promoter group)
--
--
--


- Per. of shares (as a % of the total Share Cap. of the company)
--
--
--


b) Non-encumbered


- Number of shares (Crores)
--
--
--


- Per. of shares (as a % of the total sh. of prom. and promoter group)
--
--
--


- Per. of shares (as a % of the total Share Cap. of the company)
--
--
--


Source :  Dion Global Solutions Limited