如何在没有取消内容的情况下浏览标签

时间:2013-12-30 08:45:58

标签: java xml json escaping

我如何只浏览标签而不是内容?让我解释一个例子......

这是原始的原始回复:

<GetWhoISResponse xmlns="http://www.webservicex.net">
         <GetWhoISResult>Whois Server Version 2.0

To single out one record, look it up with "xxx", where xxx is one of the
of the records displayed above. If the records are the same, look them up
with "=xxx" to receive a full display for each record.

>>> Last update of whois database: Mon, 30 Dec 2013 08:20:00 UTC <<<

NOTICE: The expiration date displayed in this record is the date the 
registrar's sponsorship of the domain name registration in the registry is 
currently set to expire. This date does not necessarily reflect the expiration 
date of the domain name registrant's agreement with the sponsoring 
registrar.  Users may consult the sponsoring registrar's Whois database to 
view the registrar's reported date of expiration for this registration.

</GetWhoISResult>
      </GetWhoISResponse>

如果我使用StringEscapeUtils和unescape text(unescapeXml):

<GetWhoISResponse xmlns="http://www.webservicex.net">
    <GetWhoISResult>Whois Server Version 2.0

To single out one record, look it up with "xxx", where xxx is one of the
of the records displayed above. If the records are the same, look them up
with "=xxx" to receive a full display for each record.

>>> Last update of whois database: Mon, 30 Dec 2013 08:20:00 UTC <<<

NOTICE: The expiration date displayed in this record is the date the 
registrar's sponsorship of the domain name registration in the registry is 
currently set to expire. This date does not necessarily reflect the expiration 
date of the domain name registrant's agreement with the sponsoring 
registrar.  Users may consult the sponsoring registrar's Whois database to 
view the registrar's reported date of expiration for this registration.

    </GetWhoISResult>
</GetWhoISResponse>

问题出在中间,在<>转义的行中。我需要这个,因为我想将它转换为JSON,但现在我得到解析错误。

2 个答案:

答案 0 :(得分:1)

这是一个有趣的问题,我尝试使用宽容的xml解析器,但它们似乎没有解析破碎的xml。下一个最好的选择是正则表达式,我设法通过它解析给定的xml,并注意到较小和较大的符号不应形成标记的模式,如:

< some random text here and >

经过一些研究,我最终确定了给定xml的2个正则表达式模式(也可以以通用格式使用):

public static final String LESSER_STRING = "<(.[^>]*)(<)+";
public static final String GREATER_STRING = ">[^<](.[^<]*)(>)+";

这些字符串用于建立匹配器的正则表达式模式以扫描序列。

以下是输出的工作代码:

public static final String LESSER_STRING = "<(.[^>]*)(<)+";
    public static final String GREATER_STRING = ">[^<](.[^<]*)(>)+";
    public static final String ESCAPED_XML = "&lt;GetWhoISResponse xmlns=&quot;http://www.webservicex.net&quot;&gt;&lt;GetWhoISResult&gt;Whois Server Version 2.0 To single out one record, look it up with &quot;xxx&quot;, where xxx is one of the of the records displayed above. If the records are the same, look them up with &quot;=xxx&quot; to receive a full display for each record. &gt;&gt;&gt; Last update of whois database: Mon, 30 Dec 2013 08:20:00 UTC &lt;&lt;&lt; NOTICE: The expiration date displayed in this record is the date the registrar&apos;s sponsorship of the domain name registration in the registry is currently set to expire. This date does not necessarily reflect the expiration date of the domain name registrant&apos;s agreement with the sponsoring registrar.  Users may consult the sponsoring registrar&apos;s Whois database to view the registrar&apos;s reported date of expiration for this registration.&lt;/GetWhoISResult&gt;&lt;/GetWhoISResponse&gt;";
    private static Matcher matcher;
    private static Pattern pattern;
    private static String alter;
    private static StringBuffer str = new StringBuffer();
    private static StringBuffer jsonString = new StringBuffer();

    public static void main(String[] args) {
        String xml = StringEscapeUtils.unescapeXml(ESCAPED_XML);

        pattern = Pattern.compile(GREATER_STRING);
        matcher = pattern.matcher(xml);

        while (matcher.find()) {
            System.out.println(matcher.group(0));
            System.out.println(matcher.group(0).substring(1));

            // Find the first encountered greater than sing assuming greater
            // than and less than do not form a 'tag' pattern

            // Picks the first value after the 'last opened tag' including the
            // greater sign - take substring 1
            alter = ">" + matcher.group(0).substring(1).replaceAll(">", "&gt;");
            matcher.appendReplacement(str, alter);
        }

        matcher.appendTail(str);

        pattern = Pattern.compile(LESSER_STRING);
        matcher = pattern.matcher(str);

        while (matcher.find()) {
            System.out.println(matcher.group(0));
            System.out.println(matcher.group(0).substring(0,
                    matcher.group(0).length() - 1));

            // Find the encountered lesser than sign assuming greater
            // than and less than do not form a 'tag' pattern

            // Picks the content between the lesser tags and the last opened
            // tag; including the lesser sign of the tag
            // Reduce it by 1 to prevent the last tag getting replaced
            alter = matcher.group(0)
                    .substring(0, matcher.group(0).length() - 1);

            // Add the last tag as is without replacing
            alter = alter.replaceAll("<", "&lt;") + "<";
            matcher.appendReplacement(jsonString, alter);

        }

        matcher.appendTail(jsonString);

        System.out.println(jsonString);
    }

<强>输出:

<GetWhoISResponse xmlns="http://www.webservicex.net"><GetWhoISResult>Whois Server Version 2.0 To single out one record, look it up with "xxx", where xxx is one of the of the records displayed above. If the records are the same, look them up with "=xxx" to receive a full display for each record. &gt;&gt;&gt; Last update of whois database: Mon, 30 Dec 2013 08:20:00 UTC &lt;&lt;&lt; NOTICE: The expiration date displayed in this record is the date the registrar's sponsorship of the domain name registration in the registry is currently set to expire. This date does not necessarily reflect the expiration date of the domain name registrant's agreement with the sponsoring registrar.  Users may consult the sponsoring registrar's Whois database to view the registrar's reported date of expiration for this registration.</GetWhoISResult></GetWhoISResponse>

答案 1 :(得分:0)

您可以阅读内容并替换“&lt;”和“&gt;”再次