从标签中提取数据 - Jsoup

时间:2018-03-13 13:53:15

标签: java html dom web-scraping jsoup

我正在尝试从以下HTML中的标签中提取文本作为项目符号。我的代码打印它的方式是一个带有一些不必要文本的连续字符串。我看到我可以做一个replace("<br>","\n"),但在这种情况下没有用,我随身携带了Google HTML数据。

<div class="report-content" style="line-height:160%!important"> Read the side effects of Abacavir as described in the medical literature. In case of any doubt consult your doctor or pharmacist. 
<!-- Ezoic - under-first-paragraph - under_first_paragraph -->
<div id="ezoic-pub-ad-placeholder-101">
          <div id="google-ads-1" class="fleft"></div>
            <script type="text/javascript">
            google_ad_client = "ca-pub-4864473589052117";
            google_ad_slot = "6404003758";
            google_ad_height = 250;
            ad1 = document.getElementById('google-ads-1');
                if (ad1.getBoundingClientRect().width) {
    google_ad_width = ad1.getBoundingClientRect().width;
    } else {
    google_ad_width = ad1.offsetWidth; // for old IE
    }
    google_ad_width=rwdscreenWidth;

        /*Full Width Ad*/
    if (google_ad_width>1024) {
    google_ad_width = 880;
    google_ad_height = 300;
    } 
    else if ((google_ad_width<1025) && (google_ad_width>959)) {
    google_ad_width = 605;
    google_ad_height = 300;
    }
    else if ((google_ad_width<960) && (google_ad_width>799)) {
    google_ad_width = 730;
    google_ad_height = 300;
    }
    else if ((google_ad_width<800) && (google_ad_width>767)) {
    google_ad_width = 600;
    google_ad_height = 300;
    }
    else if ((google_ad_width<768) && (google_ad_width>599)) {
    google_ad_width = 540;
    google_ad_height = 300;
    }
    else if ((google_ad_width<600) && (google_ad_width>479)) {
    google_ad_width = 420;
    google_ad_height = 250;
    }   
    else if ((google_ad_width<480) && (google_ad_width>300)) {
    google_ad_width = 300;
    google_ad_height = 250;
    }       
    else {
    google_ad_width = 300;
    google_ad_height = 250;
    }


    document.write (
    '<ins class="adsbygoogle" style="display:inline-block;width:'
    + google_ad_width + 'px;height:'
    + google_ad_height + 'px" data-ad-client="'
    + google_ad_client + '" data-ad-slot="'
    + google_ad_slot + '"></ins>'
    );
    (adsbygoogle = window.adsbygoogle || []).push({});
    </script><ins class="adsbygoogle" style="display:inline-block;width:600px;height:300px" data-ad-client="ca-pub-4864473589052117" data-ad-slot="6404003758" data-adsbygoogle-status="done"><ins id="aswift_1_expand" style="display:inline-table;border:none;height:300px;margin:0;padding:0;position:relative;visibility:visible;width:600px;background-color:transparent;"><ins id="aswift_1_anchor" style="display:block;border:none;height:300px;margin:0;padding:0;position:relative;visibility:visible;width:600px;background-color:transparent;"><iframe width="600" height="300" frameborder="0" marginwidth="0" marginheight="0" vspace="0" hspace="0" allowtransparency="true" scrolling="no" allowfullscreen="true" onload="var i=this.id,s=window.google_iframe_oncopy,H=s&amp;&amp;s.handlers,h=H&amp;&amp;H[i],w=this.contentWindow,d;try{d=w.document}catch(e){}if(h&amp;&amp;d&amp;&amp;(!d.body||!d.body.firstChild)){if(h.call){setTimeout(h,0)}else if(h.match){try{h=s.upd(h,i)}catch(e){}w.location.replace(h)}}" id="aswift_1" name="aswift_1" style="left:0;position:absolute;top:0;width:600px;height:300px;"></iframe></ins></ins></ins>

    <script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>

</div><!-- End Ezoic - under-first-paragraph - under_first_paragraph -->
<br><br> Sleep disturbances, headache, depressive disorders
<br><br> Digestive tract disorders like nausea, diarrhea
<br><br> Allergic reaction, which may be mild to severe
<br><br> Liver disease, which may cause nausea, jaundice, dark-colored urine, clay-colored stools
<br><br> Reaction to infections in the body due to improvement in the immune status
<br><br> Redistribution of fat resulting in thin limbs, fat abdomen and hump in upper back


<br>



                    <div class="pad10"></div><b>Other Precautions :&nbsp;</b>•  Monitor and treat the signs of lactic acidosis such as upset stomach, fluctuations in heartbeat,unexplained muscle pain, and difficulty in breathing.<br>

•  Patient's body fat and cardiac parameters should be measured regularly to avoid heart related illness.<br>
</div>

我的代码

public static void main(String[] args) throws IOException {


            Connection.Response response = null;
            Document doc = null;

            final String url = "https://www.medindia.net/drugs/medication-side-effects/abacavir.htm";


                response = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0")
                        .execute();

                doc = response.parse();



                String text =doc.select("div.report-content").first().text();
                Jsoup.clean(text, Whitelist.basic());

                System.out.println(text);

        }

    }

我的输出

  

阅读医学中描述的阿巴卡韦的副作用   文献。如有任何疑问,请咨询您的医生或药剂师。   睡眠障碍,头痛,抑郁症消化道   恶心,腹泻等疾病过敏反应,可能是轻微的   严重肝病,可能引起恶心,黄疸,   深色尿液,粘土色大便反应感染   身体由于免疫状态的改善脂肪的重新分配   导致四肢薄,腹部肥胖和上背部驼峰其他   注意事项:?监测和治疗乳酸性酸中毒的迹象如   胃部不适,心跳波动,原因不明的肌肉疼痛,以及   呼吸困难。 ?患者的体脂和心脏参数   应定期测量以避免与心脏有关的疾病。

预期输出

 Sleep disturbances, headache, depressive disorders
 Digestive tract disorders like nausea, diarrhea
 Allergic reaction, which may be mild to severe
 Liver disease, which may cause nausea, jaundice, dark-colored urine, clay-
 colored stools
 Reaction to infections in the body due to improvement in the immune status

1 个答案:

答案 0 :(得分:1)

<br>被称为empty element,这意味着无法包含数据。因为它是一个空元素,<br><br/>的行为方式相同:它们会立即关闭。它们不(也不能)包含数据。该文字由封闭的<div>.report-content类包含。