通过jsoup提取html文本作为任务和答案

时间:2019-02-28 11:56:05

标签: java jsoup

我有一个需要提取的HTML文件。

enter code here

      <div>
         <p style="margin-top:4.1pt; margin-left:6pt; margin-bottom:0pt; widows:0; orphans:0; font-size:8.5pt">
         <a href="#"style="text-decoration:none">
         <span style="font-family:Verdana; text-decoration:underline; color:#11569b">ABC Company</span></a></p>
         <p style="margin-top:0.35pt; margin-bottom:0pt; widows:0; orphans:0; font-size:7pt"><span style="font-family:Verdana; -aw-import:ignore">&#xa0;</span></p>
         <ul type="disc" style="margin:0pt; padding-left:0pt">
            <li style="margin-top:4.95pt; margin-left:52.25pt; widows:0; orphans:0; padding-left:8.45pt; font-family:serif; font-size:10pt; -aw-font-family:'Symbol'; -aw-font-weight:normal; -aw-number-format:''"><span style="font-family:Verdana; font-size:8.5pt">This is abc company text</span><span style="font-family:Verdana; font-size:8.5pt; letter-spacing:-0.85pt"> </span><span style="font-family:Verdana; font-size:8.5pt">(Form)</span></li>
         </ul>
         <p style="margin-top:0.35pt; margin-bottom:0pt; widows:0; orphans:0; font-size:11pt"><span style="font-family:Verdana; -aw-import:ignore">&#xa0;</span></p>
         <p style="margin-top:0.05pt; margin-left:6pt; margin-bottom:0pt; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; font-weight:bold">Comments:</span></p>
         <p style="margin:7.65pt 12.35pt 0pt 6pt; line-height:167%; widows:0; orphans:0; font-size:10pt"><span style="font-family:Verdana; font-size:8.5pt; font-weight:bold; letter-spacing:-0.05pt">(1)</span><span style="font:7pt 'Times New Roman'"> </span><span style="font-family:Verdana; font-size:8.5pt">Sample text </span><span style="font-family:Arial; font-weight:bold; color:#ff0000">–Sample text1. ABC</span><span style="font-family:Arial; font-weight:bold; letter-spacing:-1.9pt; color:#ff0000"> </span><span style="font-family:Arial; font-weight:bold; color:#ff0000">Policy</span></p>
         <p style="margin-top:0pt; margin-left:5.95pt; margin-bottom:0pt; line-height:10.35pt; widows:0; orphans:0"><span style="font-family:Arial; font-size:10pt; font-weight:bold; color:#ff0000">ABC has been updated.</span></p>
         <p style="margin-top:0pt; margin-bottom:0pt; widows:0; orphans:0; font-size:11pt"><span style="font-family:Arial; font-weight:bold; -aw-import:ignore">&#xa0;</span></p>
         <p style="margin-top:0.05pt; margin-bottom:0pt; widows:0; orphans:0; font-size:11.5pt"><span style="font-family:Arial; font-weight:bold; -aw-import:ignore">&#xa0;</span></p>
         <p style="margin:0pt 18pt 0pt 6pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; letter-spacing:-0.05pt">(2)</span><span style="font:7pt 'Times New Roman'; -aw-import:spaces">&#xa0; </span><span style="font-family:Verdana">ASDFFGHFGHFGHFGHFGHFGHFJGKHHJKHKHKJHJKHKJ</span><span style="font-family:Verdana; letter-spacing:-0.15pt"> </span><span style="font-family:Verdana">removed:</span></p>
         <p style="margin-top:0.1pt; margin-left:6pt; margin-bottom:0pt; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana">1. "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"</span></p>
         <p style="margin-top:7.7pt; margin-left:5.95pt; margin-bottom:0pt; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana">3. "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"</span></p>
         <p style="margin-top:7.65pt; margin-left:17.5pt; margin-bottom:0pt; text-indent:-11.6pt; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; letter-spacing:-0.05pt">3.</span><span style="font:7pt 'Times New Roman'; -aw-import:spaces">&#xa0; </span><span style="font-family:Verdana">"CCCCCCCCCCC</span><span style="font-family:Verdana; letter-spacing:-0.35pt"> </span><span style="font-family:Verdana">it"</span></p>
         <p style="margin:7.65pt 39.75pt 0pt 5.95pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana">2.a. "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD</span></p>
         <p style="margin-top:0.1pt; margin-left:5.95pt; margin-bottom:0pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; font-weight:bold; color:#ff0000">EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE</span></p>
         <p style="margin:0.05pt 20.95pt 0pt 5.95pt; text-indent:0pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; font-weight:bold; letter-spacing:-0.05pt">(3)</span><span style="font:7pt 'Times New Roman'"> </span><span style="font-family:Verdana">FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF</span><span style="font-family:Verdana; letter-spacing:-2pt"> </span><span style="font-family:Verdana"> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG </span><span style="font-family:Verdana; font-weight:bold; color:#ff0000">Definition has been</span><span style="font-family:Verdana; font-weight:bold; letter-spacing:-0.75pt; color:#ff0000"> </span><span style="font-family:Verdana; font-weight:bold; color:#ff0000">updated.</span></p>
         <p style="margin-top:0.5pt; margin-bottom:0pt; widows:0; orphans:0; font-size:14.5pt"><span style="font-family:Verdana; font-weight:bold; -aw-import:ignore">&#xa0;</span></p>
         <p style="margin:0pt 6.9pt 0pt 5.95pt; text-indent:0pt; line-height:174%; widows:0; orphans:0; font-size:11pt"><span style="font-family:Verdana; font-size:8.5pt; font-weight:bold; letter-spacing:-0.05pt">(4)</span><span style="font:7pt 'Times New Roman'"> </span><span style="font-family:Verdana; font-size:8.5pt">HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH</span><span style="font-family:Verdana; font-size:8.5pt; color:#11569b"> </span><a href="#" style="text-decoration:none"><span style="font-family:Verdana; font-size:8.5pt; text-decoration:underline; color:#11569b"></span></a><span style="font-family:Verdana; font-size:8.5pt; color:#ff0000"> </span><span style="font-family:Verdana; font-size:8.5pt; font-weight:bold; color:#ff0000">IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII</span><span style="font-family:Verdana; font-size:8.5pt; font-weight:bold; letter-spacing:-2.1pt; color:#ff0000"> </span><span style="font-family:Verdana; font-size:8.5pt; font-weight:bold; color:#ff0000">708</span></p>
         <p style="margin:0.05pt 27.8pt 0pt 5.95pt; text-indent:0pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; font-weight:bold; letter-spacing:-0.05pt">(5)</span><span style="font:7pt 'Times New Roman'"> </span><span style="font-family:Verdana">JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ</span><span style="font-family:Verdana; letter-spacing:-2.1pt"> </span><span style="font-family:Verdana">KKKKKKKKKKKKKKKKKKKKKKKKKKK </span><span style="font-family:Verdana; font-weight:bold; color:#ff0000">– LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL</span><span style="font-family:Verdana; font-weight:bold; letter-spacing:-0.7pt; color:#ff0000"> </span><span style="font-family:Verdana; font-weight:bold; color:#ff0000">added.</span></p>
         <p style="margin:0.1pt 6.5pt 0pt 5.95pt; text-indent:0pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; font-weight:bold; letter-spacing:-0.05pt">(6)</span><span style="font:7pt 'Times New Roman'"> </span><span style="font-family:Verdana">UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU </span><span style="font-family:Verdana; font-weight:bold; color:#ff0000">Language has been</span><span style="font-family:Verdana; font-weight:bold; letter-spacing:-0.9pt; color:#ff0000"> </span><span style="font-family:Verdana; font-weight:bold; color:#ff0000">updated.</span></p>
         <p style="margin:0.05pt 20.65pt 0pt 6pt; text-indent:-0.05pt; line-height:174%; widows:0; orphans:0; font-size:8.5pt"><span style="font-family:Verdana; letter-spacing:-0.05pt">(7)</span><span style="font:7pt 'Times New Roman'; -aw-import:spaces">&#xa0; </span><span style="font-family:Verdana">OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO,</span><span style="font-family:Verdana; letter-spacing:-1.7pt"> </span><span style="font-family:Verdana">the</span></p>
      </div>

我需要类似

的输出

声明回应 ABC公司的那些文字是红色的

我的jsoup代码是

 public static List<PDFReaderBean> getContent(String string, List<PDFReaderBean> pdfContent) throws IOException {
    File f = new File(string);
    Document doc = Jsoup.parse(f, null);
    Elements div = doc.select("div");
    //PDFReaderBean bean=null;
    PDFReaderBean bean = null;
    boolean qflag = false;
    boolean aflag = false;
    StringBuilder que = new StringBuilder();
    StringBuilder ans = new StringBuilder();
    List boldData = new ArrayList(1);
    for (Element p : div) {
        System.out.println(""+p.select("p").select("span[style=\"font-family:Arial; font-weight:bold; color:#ff0000\"]"));
    }
    return pdfContent;
}

请给我一个好的解决方案。

预先感谢

1 个答案:

答案 0 :(得分:0)

您的代码选择两个元素:

<span style="font-family:Arial; font-weight:bold; color:#ff0000">�Sample text1. ABC</span>
<span style="font-family:Arial; font-weight:bold; color:#ff0000">Policy</span>

如果只想选择文本,请添加.first().text()

System.out.println("" + p.select("p").select("span[style=\"font-family:Arial; font-weight:bold; color:#ff0000\"]").first().text());