如何从java中的Html中提取Div标签中的文本

时间:2012-07-26 04:48:53

标签: java java-me html-parsing

您好,

我想在div代码

之间提取文字
<div class="innercontenttxt"> 
<p>img border="1" align="left" height="170" width="324" vspace="3" hspace="2" src="/tmdbuserfiles/ramdev-balakrishna(1).jpg" alt="ramdev aide remanded, lakrishna acharya judicial remand, ramdev aide fake passport case, baba ramdev assistant judicial custody, balakrishna sent to judicial custody, yoga guru ramdev assistant remanded, yoga guru ramdev assistant balakrishna" />
Yoga guru Ramdev's aide Balakrishna Acharya remanded to 14 days judicial custody in a fake passport on Saturday. He was arrested yesterday after he failed to appear at a Dehradun court.
    <br />
    <br />
     Balakrishna Acharya, who is basically a Nepalese citizen, 
     is alleged to have submitted fake documents to procure a passport. 
     When he failed to appear in Dehradun court in connection with the case,
</p>  
</div>

提取结果后应该是:

  ramdev aide alakrishna Acharya还原了14天   星期六在假护照上司法拘留。他被捕了   昨天他没有出现在德拉敦法院.Balakrishna   据称Acharya基本上是一名尼泊尔公民   提交假证件以取得护照。当他失败了   在法院判决的情况下出现在德拉敦法院   发出了不可转让的逮捕令,并于昨日将其逮捕。

2 个答案:

答案 0 :(得分:1)

您可能想尝试一些Java HTML解析器库

HTML解析器 - http://htmlparser.sourceforge.net

jsoup - http://jsoup.org/

答案 1 :(得分:1)

此问题与此other question类似。

假设您已经将html源存储在名为htmlPage的String变量中。

int divIndex = htmlPage.indexOf("<div");
divIndex = htmlPage.indexOf(">", divIndex);

int endDivIndex = htmlPage.indexOf("</div>", divIndex);
String content = htmlPage.substring(divIndex + 1, endDivIndex);