jSoup获取HTML标记的值

时间:2016-01-10 00:39:20

标签: java html parsing jsoup

我正在从互联网上读取一个html文件,当我读取文件时,我的控制台的输出如下:

<string>
       <String1>
        text
       </String1>
       <level2>
        text2
       </level2>
       <level3>
        text3
       </level3>
       <level4>
        text4
       </level4>
       <level5>
         TEXT
       </level5>
</string>
<string>
           <String2>
            text
           </String2>
           <level2>
            text2
           </level2>
           <level3>
            text3
           </level3>
           <level4>
            text4
           </level4>
           <level5>
             THIS TEXT
           </level5>
    </string>

如何访问第二个字符串中的level5文本?我一整天都在努力,没有运气,非常感谢那些了解更多相关信息的人的一些意见。

这是我的代码:

String line = null;

            try {
                // FileReader reads text files in the default encoding.
                FileReader fileReader = new FileReader(String.valueOf(doc));

                // Always wrap FileReader in BufferedReader.
                BufferedReader bufferedReader = new BufferedReader(fileReader);

                while ((line = bufferedReader.readLine()) != null) {
                    Elements tdElements = doc.getElementsByTag("level1");
                    for(Element element : tdElements )
                    {
                        //Print the value of the element
                        System.out.println(element.text());
                    }

                }

                // Always close files.
                bufferedReader.close();
            } catch (FileNotFoundException ex) {
                System.out.println(
                        "Unable to open file '" +
                                doc + "'");
            } catch (IOException ex) {
                System.out.println(
                        "Error reading file '"
                                + doc + "'");
                // Or we could just do this:
                // ex.printStackTrace();
            }
        }
//
        catch (IOException e) {
            e.printStackTrace();
        }

3 个答案:

答案 0 :(得分:1)

下面的代码使用JSoup来解析您所引用的文本。变量'textToParse'是您提供的上述html代码。您可以使用JSoup的Psuedo选择器来查找DOM树中特定位置的元素。希望这是你想要的。

Document document = Jsoup.parse(textToParse);
Elements stringTags = document.select("string:eq(1)");
for(Element e : stringTags) {
    System.out.println(e.select("level5").text());
}

//Output: THIS TEXT

答案 1 :(得分:1)

您可以在此处使用CSS选择器:

string:nth-of-type(2) > level5

DEMO:http://try.jsoup.org/~8w_pfCxDhJwIseTKiKsQjQJOBRs

描述

string:nth-of-type(2) /* Select the 2nd string node in document... */
> level5                /* ... then select all "level5" child nodes  */

示例代码

Document doc = ...
Element level5Node = doc.select("string:nth-of-type(2) > level5").first();
if (level5Node ==null) {
   throw new RuntimeException("Unable to locate level5 text...");
}

System.out.println(level5Node.text()); // THIS TEXT

答案 2 :(得分:0)

解决方案1:您的HTML是有效的XML:使用XML工具:

你可以使用XPath获得第二级别5:“// string [2] / level5”

解决方案2:使用Jsoup解析它并获取文档 然后使用Xpath作为解决方案1 ​​

使用XPath / XSoup查看Jsoup:Does jsoup support xpath?

解决方案1:

String xml="<root>"+your xml+"</root>";

DocumentBuilderFactory builderFactory =DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));
XPath xPath = XPathFactory.newInstance().newXPath();
String expression="//string[2]/level5";
String value = xPath.evaluate(expression, document);
System.out.println("EVALUATE:"+value);