我正在使用以下HTML数据。我需要提取一个名称以及它下面的内容。我还需要检查是否存在所有这些标题和数据,因为某些页面可能没有所有标题。例如
Dosage
PO- The recommended dose is 25mg once daily for 5 days.
HTML代码
<div class="mt10"></div>
<a name="MedicalCondition"></a>
<br>
<span style="font-size:34px">Medical Condition(s) for which Aablaquin (25mg) may be prescribed</span><div class="mt10"></div><br>
<div class="col-sm-1 col-md-2">
<div class=""> <a href="https://www.medindia.net/drugs/medical-condition/malaria.htm"><img src="https://www.medindia.net/images/patientinfo/120x100/malaria.jpg" alt="Malaria" class="img-responsive" width="120px" height="100px"></a> </div>
<div class="caption">
<b><a href="https://www.medindia.net/drugs/medical-condition/malaria.htm">Malaria</a></b>
</div>
</div><div class="col-sm-1 col-md-2">
<div class=""> <a href="https://www.medindia.net/drugs/medical-condition/malaria.htm"><img src="https://www.medindia.net/images/patientinfo/120x100/malaria-waterborne.jpg" alt="Malaria - Waterborne" class="img-responsive" width="120px" height="100px"></a> </div>
<div class="caption">
<b><a href="https://www.medindia.net/drugs/medical-condition/malaria.htm">Malaria - Waterborne</a></b>
</div>
</div> <div class="clear"></div>
<hr size="1" color="#333333">
<div class="mi-container__left">
<div class="mi-container__fluid ">
<div class="clear"></div><div class="mt20"></div>
<!--include file = "../includes-rwd/bootstrap/widgets/share.asp"-->
<div style="clear: both;"></div>
<a name="Sideeffects"></a> <div class="drug-header"><h2>Side effects of Aablaquin (25mg)</h2></div>
<p class="drug-content">No significant side effects. The drug has a good safety profile.</p>
<div style="clear: both;"></div>
<a name="Dosage"></a>
<div class="drug-header"><h2>Dosage & When it is to be taken (Indications)</h2></div>
<p class="drug-content">PO- The recommended dose is 25mg once daily for 5 days.</p>
<div style="clear: both;"></div>
<div class="drug-header"><h2>How to use Aablaquin (25mg)?</h2></div>
<p class="drug-content">It comes as a capsule to take by mouth, with or without food.</p>
<div style="clear: both;"></div>
<a name="Contraindications"></a>
<div class="drug-header"><h2>When is Aablaquin (25mg) not to be taken? (Contraindications)</h2></div>
<p class="drug-content">Contraindicated in patients with rheumatoid arthritis, systemic lupus erythematosus and co-administration of drugs known to cause haemolysis.</p>
<div style="clear: both;"></div>
这是我的代码:
final String url = "https://www.medindia.net/drug-price/bulaquine/aablaquin-25mg.htm";
Document document = Jsoup.connect(url).get();
List<List<Node>> articles = new ArrayList<List<Node>>();
List<Node> currentArticle = null;
Element table = document.select("table").get(0);
for (Element rows : table.select("tr")) {
for (Element tds : rows.select("td")) {
Elements links = tds.select("span");
for (Element link : links) {
//System.out.println("link : " + link.attr("span"));
System.out.println("text : " + link.text());
}
}
}
Elements links = document.select("a[name]");
System.out.println(links);
Elements reportContent = document.select("div[class=drug-header]") ;
for (Element row : reportContent.select("h2")) {
for (Element column : row.select("p")) {
System.out.println(column);
}
}
}
我得到的输出并不像预期的那样。我可以从页面上给出的表格中提取详细信息,但不能从其下的标题和内容中提取。
输出
text : Capsule
text : AHPL
text : Generic : Bulaquine
text : Unit: 25mg
text : Quantity : 10
<a name="Prescription"></a>
<a name="Overview"></a>
<a name="PriceDetails"></a>
<a name="MedicalCondition"></a>
<a name="Sideeffects"></a>
<a name="Dosage"></a>
<a name="Contraindications"></a>
<a name="misseddose"></a>
<a name="Warning"></a>
<a name="otherbrands"></a>
答案 0 :(得分:0)
这是因为<p>
标记中没有嵌套<h2>
个标记。
试试这个:
Elements reportContent = document.select("div[class=drug-header]");
for (Element row : reportContent.select("h2")) {
System.out.println(row.text());
}
答案 1 :(得分:0)
你应该能够先获取所有div药物标题public signin(){
return this.http.post<any>(`${env.API_URL}/user/signin`, { username: user.username, password: user.password }).pipe(
tap(
data => this.customFunction(data),
error => error
)
).toPromise();
}
,然后根据以下内容获取标题文本和段落内容:
Elements