使用Jsoup提取HTML数据

时间:2016-07-29 16:05:40

标签: java html jsoup informatica

我有一个带有ID,TEXT等列的表这里TEXT是clob列,其中包含HTML FORMAT中的数据

示例数据:

<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm<o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<SPAN style="mso-spacerun: yes">  </SPAN>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<SPAN style="mso-spacerun: yes">  </SPAN>The following items represent the scope and visit focus areas:<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program<o:p></o:p></SPAN></P>

我使用了在informatica中用Jsoup.jar文件导入的java转换。 当我使用 Jsoup.parse(AUDIT_SCOPE_LOB).toString(); 我获取如下数据时

<html>
 <head></head>
 <body>
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am
    <!--?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /-->
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<span style="mso-spacerun: yes"> </span>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<span style="mso-spacerun: yes"> </span>The following items represent the scope and visit focus areas:
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program
    <o:p></o:p></span></p> 
 </body>
</html>

当我使用 Jsoup.parse(AUDIT_SCOPE_LOB).text(); 我获取如下数据时

Start: 8:30 am End: 4 pm The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas: 1. SOP Program 2. Training Program 3. Calibration/Preventive Maintenance Program

我对java不太了解。 我可以使用jsoup获取java代码来提取数据并重新启动outpu,如下所示

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

实际上这个数据是一个样本数据。我有html标签的数据,这里没有提到。

2 个答案:

答案 0 :(得分:1)

由于信息在<p>标记之间划分,您必须选择所有这些标记,并逐个打印文本,假设AUDIT_SCOPE_LOB是有效的Java String

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB);
    Elements el = doc.select("p");
    for (Element e : el) {
        System.out.println(e.text());
    }

答案 1 :(得分:1)

org.jsoup.nodes.Element.toString()返回org.jsoup.nodes.Element.outerHTML()

  

获取此节点的外部HTML。


org.jsoup.nodes.Element.text()
  

获取此元素及其所有子元素的组合文本。   空格被标准化和修剪。


因此,对整个示例调用toString()将返回与输出相同的内容。同样,调用text()将返回没有标记的所有文本,作为单个String。但是,你想要的是每个文本段的单独字符串。

您的某些段落标记为空。为了获得示例中的输出,您应该首先验证每个段落是否有文本。

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB, "UTF-8");

for (Element p : doc.select("p"))
    if (p.hasText())
        System.out.println(p.text());

<强>输出

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined bel ow. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

有关如何解析数据的更多示例,请查看CSS Selectors。例如,如果要解析有序列表,可以选择类名并检索列表中的第二个跨度。

for (Element span : doc.select("p.MsoNormal > span:nth-child(2)")) 
     System.out.println(span.ownText());

<强>输出

SOP Program
Training Program
Calibration/Preventive Maintenance Program