我有这个结构的大xml文件(~1GB):
<?xml version="1.0" encoding="UTF-8"?>
<GenoExchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm.nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_5.xsd" dbSNPBuildNo="146" reportId="MT" reportType="chromosome">
<Population popId="638" handle="TSC-CSHL" locPopId="TSC_42_AA">
<popClass self="NORTH AMERICA"/>
</Population>
<SnpInfo rsId="1041870" observed="C/T">
<SnpLoc genomicAssembly="107:GRCh38.p2" geneId="4512" geneSymbol="COX1" chrom="MT" start="6150" locType="2" rsOrientToChrom="fwd" contigAllele="T" contig="NC_012920:1"/>
<SsInfo ssId="1508548" locSnpId="TSC0349089" ssOrientToRs="fwd">
<ByPop popId="1303" sampleSize="184">
<AlleleFreq allele="T" freq="1"/>
<AlleleFreq allele="C" freq="0"/>
</ByPop>
</SsInfo>
</SnpInfo>
<SnpInfo rsId="1029293" observed="C/T">
<SnpLoc genomicAssembly="107:GRCh38.p2" geneId="4512" geneSymbol="COX1" chrom="MT" start="6307" locType="2" rsOrientToChrom="fwd" contigAllele="C" contig="NC_012920:1"/>
<SsInfo ssId="1494519" locSnpId="TSC0254145" ssOrientToRs="fwd">
<ByPop popId="639" sampleSize="82">
<AlleleFreq allele="T" freq="0"/>
<AlleleFreq allele="C" freq="1"/>
</ByPop>
<ByPop popId="1303" sampleSize="184">
<AlleleFreq allele="T" freq="0"/>
<AlleleFreq allele="C" freq="1"/>
</ByPop>
</SsInfo>
</SnpInfo>
我想找到一个特定的rsID,例如rsID =&#34; 1029293&#34;并提取该节点内的所有信息。我不想运行所有文件。我只想找到该ID,提取该信息并结束迭代。 根据我的阅读,如果我使用SAX或Stax解析器,它会更好。我使用SAX,这是我的代码:
class UserHandler extends DefaultHandler {
String rsID = null;
String i = "1029293";
@Override
public void startElement(String uri,
String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("SnpInfo")) {
rsID = attributes.getValue("rsId");
//System.out.println("value: " + rsID);
}
if((i).equals(rsID) &&
qName.equalsIgnoreCase("SnpInfo")){
System.out.println("Start Element: " + qName + " " + rsID);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("SsInfo")) {
String a = attributes.getValue("ssId");
System.out.println("SSID: " + a);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("ByPop")) {
String p = attributes.getValue("popId");
System.out.println("POPID: " + p);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("AlleleFreq")) {
String p = attributes.getValue("allele");
String f = attributes.getValue("freq");
System.out.println("ALLELE: " + p + " FREQ: " + f);
}
if ((i).equals(rsID) && qName.equalsIgnoreCase("GTypeFreq")) {
String p = attributes.getValue("gtype");
String f = attributes.getValue("freq");
System.out.println("GTYPE: " + p + " FREQ: " + f);
}
}
@Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("SnpInfo")) {
if((i).equals(rsID)
&& qName.equalsIgnoreCase("SnpInfo"))
System.out.println("End Element: " + qName);
}
}
}
public class XMLParser {
public static void main(String argv[]) {
try {
InputStream fileStream = new FileInputStream("/home/xml/gt_chr10.xml.gz");
InputStream gzipStream = new GZIPInputStream(fileStream);
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(gzipStream, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
我的问题是我的代码在整个文件中搜索ID,每次都需要2分钟以上。我不能拥有这么长时间的代码。 对此有更好的方法吗?
答案 0 :(得分:1)
您可以在结束元素处理程序中抛出异常,以向解析器指示它中止解析(http://www.ibm.com/developerworks/library/x-tipsaxstop/):
@Override
public void endElement(String uri,
String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("SnpInfo")) {
if((i).equals(rsID)
&& qName.equalsIgnoreCase("SnpInfo"))
System.out.println("End Element: " + qName);
throw SAXException("Element found.");
}
}
答案 1 :(得分:1)
使用STAX可以在解析XML时提供更多控制,因为您可以主动从流中提取元素。通过这种方式,您可以拉下一个事件,处理它,一旦找到数据,只需终止循环(如果必须,可以使用标志甚至返回语句)
InputStream in = ...
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader(in);
boolean found = false;
while (!found && eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
switch (event.getEventType()) {
case XMLStreamConstants.START_ELEMENT:
// your logic here
// once you found your element, you can terminate the loop
found = true;
break;
case XMLStreamConstants.END_ELEMENT:
// your logic here
break;
}
}
(为简洁省略了例外和资源处理)
另外,您可以将if ((i).equals(rsID) && ...
合并为一个,并在嵌套ifs
if ((i).equals(rsID)) {
if(qName.equalsIgnoreCase("GTypeFreq")) {
...
}
}
答案 2 :(得分:1)
每次运行时避免解析整个文件的唯一方法是将数据放入XML数据库。解析1Gb文件将需要大约一分钟,加或减,具体取决于您的机器速度以及您在每个节点上执行的处理。
流式XSLT 3.0解决方案很简单:
<xsl:transform version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xpath-default-namespace="http://www.ncbi.nlm.nih.gov/SNP/geno">
<xsl:template name="xsl:initial-template">
<xsl:stream href="input.xml">
<xsl:copy-of select="/GenoExchange/SnpInfo[@rsId='1041870'][1]"/>
</xsl:stream>
</xsl:template>
</xsl:transform>
无需编写所有令人讨厌的SAX或StAX代码。
我把&#34; [1]&#34;谓词允许处理器在找到第一个匹配时放弃搜索。
答案 3 :(得分:1)
最好的方法是使用vtd-xml和xpath ... 1GB xml文件需要大约1.5GB的堆空间和&lt;在一个3~4岁的intel处理器中10秒。请参见下面的代码示例。还有一件事,如果你想完全消除解析,你可以创建一个vtd + XML文件格式,这样任何后续查询都可以直接访问vtd索引部分,这可以轻松将您的应用程序性能提高三倍或四倍......
import com.ximpleware.*;
public class simpleXpathSearch{
public static void main(String s[]) throws VTDException,java.io.UnsupportedEncodingException,java.io.IOException{
VTDGen vg = new VTDGen();
vg.setLCLevel(5);
if (!vg.parseFile("input.xml", false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/*/*[@rsID='1029293']");
int i=0;
while((i=ap.evalXPath())!=-1){
// your code logic here
}
答案 4 :(得分:0)
//主要课程
public static void main(String[] args) {
SAXReader.read();
}
// SAXReader依靠
public static void read(){
try {
XMLReader processor = XMLReaderFactory.createXMLReader();
processor.setContentHandler(new SAXController());
processor.parse(new InputSource("MyXML.xml"));
} catch (SAXException | IOException e) {
System.err.println(e.getMessage());
}
}
// SAXController
// SAXController扩展了DefaultHandler
private int tab = 0;
private void tabulation() {
for (int i=0; i<tab; i++)
System.out.print(" ");
}
@Override
public void startDocument() {
tabulation();
System.out.println("Starting XML Document");
tab++;
}
@Override
public void endDocument() {
tab--;
tabulation();
System.out.println("Ending XML Document");
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes)
throws SAXException {
tabulation();
System.out.print(localName);
if (attributes.getLength()>0) {
for (int i=0; i<attributes.getLength(); i++) {
System.out.print(attributes.getLocalName(i)+": "+attributes.getValue(i));
}
}
System.out.println();
tab++;
}
@Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
tab--;
tabulation();
System.out.println(localName);
}
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String content= new String(ch, start, length);
content= content.replaceAll("[\t\n]", "").trim();
if (!content.equals("")) {
tabulation();
System.out.println(content);
}
}