Jsoup - 在'a name'类下提取数据

时间:2018-03-08 10:06:13

标签: java web-scraping jsoup

我正在使用以下HTML数据。我需要提取一个名称以及它下面的内容。我还需要检查是否存在所有这些标题和数据,因为某些页面可能没有所有标题。例如

Dosage
PO- The recommended dose is 25mg once daily for 5 days.

HTML代码

<div class="mt10"></div>    
<a name="MedicalCondition"></a>
<br>
    <span style="font-size:34px">Medical Condition(s) for which Aablaquin (25mg) may be prescribed</span><div class="mt10"></div><br>
    <div class="col-sm-1 col-md-2">
<div class=""> <a href="https://www.medindia.net/drugs/medical-condition/malaria.htm"><img src="https://www.medindia.net/images/patientinfo/120x100/malaria.jpg" alt="Malaria" class="img-responsive" width="120px" height="100px"></a> </div>
<div class="caption">
<b><a href="https://www.medindia.net/drugs/medical-condition/malaria.htm">Malaria</a></b>
</div>
</div><div class="col-sm-1 col-md-2">
<div class=""> <a href="https://www.medindia.net/drugs/medical-condition/malaria.htm"><img src="https://www.medindia.net/images/patientinfo/120x100/malaria-waterborne.jpg" alt="Malaria - Waterborne" class="img-responsive" width="120px" height="100px"></a> </div>
<div class="caption">
<b><a href="https://www.medindia.net/drugs/medical-condition/malaria.htm">Malaria - Waterborne</a></b>
</div>
</div>  <div class="clear"></div>
<hr size="1" color="#333333">

 <div class="mi-container__left">
        <div class="mi-container__fluid ">



<div class="clear"></div><div class="mt20"></div>

    <!--include file = "../includes-rwd/bootstrap/widgets/share.asp"-->


<div style="clear: both;"></div>

            <a name="Sideeffects"></a>  <div class="drug-header"><h2>Side effects of Aablaquin (25mg)</h2></div>            

<p class="drug-content">No significant side effects. The drug has a good safety profile.</p>
    <div style="clear: both;"></div>
    <a name="Dosage"></a>

    <div class="drug-header"><h2>Dosage &amp; When it is to be taken (Indications)</h2></div>       
    <p class="drug-content">PO- The recommended dose is 25mg once daily for 5 days.</p>
    <div style="clear: both;"></div>
<div class="drug-header"><h2>How to use Aablaquin (25mg)?</h2></div>

    <p class="drug-content">It comes as a capsule to take by mouth, with or without food.</p>
        <div style="clear: both;"></div>

            <a name="Contraindications"></a>            

    <div class="drug-header"><h2>When is Aablaquin (25mg) not to be taken? (Contraindications)</h2></div>       

    <p class="drug-content">Contraindicated in patients with rheumatoid arthritis, systemic lupus erythematosus and co-administration of drugs known to cause haemolysis.</p>
        <div style="clear: both;"></div>

这是我的代码:

   final String url = "https://www.medindia.net/drug-price/bulaquine/aablaquin-25mg.htm"; 
   Document document = Jsoup.connect(url).get();
   List<List<Node>> articles = new ArrayList<List<Node>>();
   List<Node> currentArticle = null;
   Element table = document.select("table").get(0);
   for (Element rows : table.select("tr")) {
        for (Element tds : rows.select("td")) {
            Elements links = tds.select("span");
            for (Element link : links) {
            //System.out.println("link : " + link.attr("span"));
            System.out.println("text : " + link.text());
            }
        }
   }

  Elements links = document.select("a[name]");
  System.out.println(links);
  Elements reportContent =  document.select("div[class=drug-header]") ;
  for (Element row : reportContent.select("h2")) {
          for (Element column : row.select("p")) {
              System.out.println(column);
          }
       }

   }

我得到的输出并不像预期的那样。我可以从页面上给出的表格中提取详细信息,但不能从其下的标题和内容中提取。

输出

text : Capsule
text : AHPL
text : Generic : Bulaquine
text : Unit: 25mg
text : Quantity : 10
<a name="Prescription"></a>
<a name="Overview"></a>
<a name="PriceDetails"></a>
<a name="MedicalCondition"></a>
<a name="Sideeffects"></a>
<a name="Dosage"></a>
<a name="Contraindications"></a>
<a name="misseddose"></a>
<a name="Warning"></a>
<a name="otherbrands"></a>

2 个答案:

答案 0 :(得分:0)

这是因为<p>标记中没有嵌套<h2>个标记。 试试这个:

Elements reportContent = document.select("div[class=drug-header]");
    for (Element row : reportContent.select("h2")) {
        System.out.println(row.text());
    }

答案 1 :(得分:0)

你应该能够先获取所有div药物标题public signin(){ return this.http.post<any>(`${env.API_URL}/user/signin`, { username: user.username, password: user.password }).pipe( tap( data => this.customFunction(data), error => error ) ).toPromise(); } ,然后根据以下内容获取标题文本和段落内容:

Elements