在Java中使用正则表达式解析HTTP XML响应

时间:2017-07-14 12:30:44

标签: java xml parsing

我正在进行API调用,现在我需要从响应中获取特定的数据。我需要获取" 描述"的 DocumentID 发票,在下面的情况下是110107。

我已经创建了一种方法来通过这样做获取单个标记的数据:

public synchronized String getTagFromHTTPResponseAsString(String tag, String body) throws IOException {

    final Pattern pattern = Pattern.compile("<"+tag+">(.+?)</"+tag+">");
    final Matcher matcher = pattern.matcher(body);
    matcher.find();

    return matcher.group(1);

} // end getTagFromHTTPResponseAsString

然而,我的问题在于此结果集,有多个字段具有相同的标记,我需要一个特定的字段。以下是回复:

<?xml version="1.0" encoding="utf-8"?>
<Order TrackingID="351535" TrackingNumber="TEST-843245" xmlns="">
  <ErrorMessage />
  <StatusDocuments>
    <StatusDocument NUM="1">
      <DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
      <FileName>4215.pdf</FileName>
      <Type>Sales Contract</Type>
      <Description>Uploaded Document</Description>
      <DocumentID>110098</DocumentID>
      <DocumentPlaceHolder />
    </StatusDocument>
    <StatusDocument NUM="2">
      <DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
      <FileName>Apex_Shortcuts.pdf</FileName>
      <Type>Other</Type>
      <Description>Uploaded Document</Description>
      <DocumentID>110100</DocumentID>
      <DocumentPlaceHolder />
    </StatusDocument>
    <StatusDocument NUM="3">
      <DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
      <FileName>CRAddend.pdf</FileName>
      <Type>Other</Type>
      <Description>Uploaded Document</Description>
      <DocumentID>110104</DocumentID>
      <DocumentPlaceHolder />
    </StatusDocument>
    <StatusDocument NUM="4">
      <DocumentDate>7/14/2017 6:52:00 AM</DocumentDate>
      <FileName>test.pdf</FileName>
      <Type>Other</Type>
      <Description>Uploaded Document</Description>
      <DocumentID>110102</DocumentID>
      <DocumentPlaceHolder />
    </StatusDocument>
    <StatusDocument NUM="5">
      <DocumentDate>7/14/2017 6:55:00 AM</DocumentDate>
      <FileName>Invoice.pdf</FileName>
      <Type>Invoice</Type>
      <Description>Invoice</Description>
      <DocumentID>110107</DocumentID>
      <DocumentPlaceHolder />
    </StatusDocument>
  </StatusDocuments>
</Order>

我尝试在https://regex101.com/上创建并测试我的正则表达式,并让此RegEx在那里工作,但我无法将其正确转换为我的Java代码:

<Description>Invoice<\/Description>
      <DocumentID>(.*?)<\/DocumentID>

2 个答案:

答案 0 :(得分:1)

使用Jsoup

尝试

示例:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class sssaa {
    public static void main(String[] args) throws Exception {
        String xml = "yourXML";        
        Document doc = Jsoup.parse(xml);
        Elements StatusDocuments = doc.select("StatusDocument");
        for(Element e : StatusDocuments){
            if(e.select("Description").text().equals("Invoice")){
                System.out.println(e.select("DocumentID").text());
            }           
        }
    }
}

答案 1 :(得分:0)

我要解决的问题是使用StringBuilder将响应转换为单个字符串,然后使用这段代码获取DocumentID:

// Create the pattern and matcher
Pattern p = Pattern.compile("<Description>Invoice<\\/Description><DocumentID>(.*)<\\/DocumentID>");
Matcher m = p.matcher(responseText);

// if an occurrence if a pattern was found in a given string...
if (m.find()) {
    // ...then you can use group() methods.
    System.out.println("group0 = " + m.group(0)); // whole matched expression
    System.out.println("group1 = " + m.group(1)); // first expression from round brackets (Testing)
}

// Set the documentID for the Invoice 
documentID = m.group(1);

看起来这可能不是最好的方法,但它现在正在运作。我会回来尝试用这里给出的更正确的解决方案来清理它。