从HTML中的多个标签获取数据

时间:2018-03-26 14:33:12

标签: java html web-scraping jsoup

我正在抓取一个医疗网站,我需要提取有关药物的标题信息,例如注意事项,禁忌症,剂量,用途等.HTML数据如下所示。如果我只使用标记p.drug-content提取信息,我会将所有标题下的内容作为一个大段落。我如何获得标题内容,其中剂量的段落应该在剂量,预防措施的注意事项,等等?

<a name="Warning"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">What are the warnings and precautions for Abacavir? </h2></div>
    <p class="drug-content">
                        • Caution is advised when used in patients with history of depression or at risk for heart disease<br>•  Avoid use with alcohol.<br>•  Take along with other anti-HIV drugs and not alone, to prevent resistance.<br>•  Continue other precautions to prevent spread of HIV infection.</p></div>
<a name="Prescription"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">Why is Abacavir Prescribed? (Indications) </h2></div>
    <p class="drug-content">Abacavir is an antiviral drug that is effective against the HIV-1 virus. It acts on an enzyme of the virus called reverse transcriptase, which plays an important role in its multiplication.  Though abacavir reduces viral load and may slow the progression of the disease, it does not cure the HIV infection.&nbsp;</p></div>
<a name="Dosage"></a>
<div class="report-content drug-widget">
    <div class="drug-header"><h2 style="color:#000000!important;">What is the dosage of Abacavir?</h2></div>
    <p class="drug-content"> Treatment of HIV-1/AIDS along with other medications. Dose in adults is 600 mg daily, as a single dose or divided into two doses.
</p></div>

这是我的代码:

private static void ScrapingDrugInfo() throws IOException{
            Connection.Response response = null;
            Document doc = null;
            List<SideEffectsObject> sideEffectsList = new ArrayList<>();
            int i=0;

            String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};

            for (String keyword : keywords){
                final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;

                response = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0")
                        .execute();

                doc = response.parse();

                Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();

                Elements links = tds.select("li[class=list-item]");


                for (Element link : links){

                    final String newURL = "https://www.medindia.net/doctors/drug_information/".concat(link.select("a").attr("href")) ;

                    response = Jsoup.connect(newURL)
                            .userAgent("Mozilla/5.0")
                            .execute();

                    doc = response.parse();

                    Elements classification = doc.select("div.clear.b");
                    System.out.println("Classification::"+classification.text());

                    Elements drugBrands = doc.select("div.drug-content");
                    Elements drugBrandsIndian = drugBrands.select("div.links");

                    System.out.println("Drug Brand Links Indian::"+ drugBrandsIndian.select("a[href]"));

                    System.out.println("Drug Brand Names Indian::"+ drugBrandsIndian.text());

                    System.out.println("Drug Brand Names International::"+doc.select("div.drug-content.h3").text());

                    Elements prescritpionText = doc.select("a[name=Prescription]");
                    Elements prescriptionData = prescritpionText.select("p.drug-content");

                    System.out.println("Prescription Data::"+ prescriptionData.text());


                    Elements contraindications = doc.select("a[name=Contraindications]");

                    Elements contraindicationsText = contraindications.select("p[class=drug-content]");

                    System.out.println("Contrainidications Text::" + contraindicationsText.text());


                    Elements dosage = doc.select("a[name=Dosage]");

                    Elements dosageText = dosage.select("p[class=drug-content]");

                    System.out.println("Dosage Text::" + dosageText.text());
     }
}

1 个答案:

答案 0 :(得分:0)

如果我正确理解了这个问题,听起来您希望将a代码name属性的值与以下div的p内容配对。您应该可以使用以下代码执行此操作:

Elements aTags = doc.select("a[name]");

for(Element header : aTags){
    System.out.println(header.attr("name"));
    // Get the sibling div of a and get it's p content
    Element pTag = header.nextElementSibling().select("p.drug-content").first();

    System.out.println(pTag.text());
}