我看到很多关于将XML文件拆分成较小块的帖子/博客/文章,并决定自己创建,因为我有一些自定义要求。这就是我的意思,请考虑以下XML:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="2">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="3">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="4">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="5">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<salary>100000</salary>
</staff>
</company>
我想将这个xml拆分为n个部分,每个部分包含1个文件,但staff
元素必须包含nickname
,如果它不存在,我不想要它。因此,这应该产生4 xml拆分,每个拆分包含从1到4的员工ID。
这是我的代码:
public int split() throws Exception{
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputFilePath)));
String line;
List<String> tempList = null;
while((line=br.readLine())!=null){
if(line.contains("<?xml version=\"1.0\"") || line.contains("<" + rootElement + ">") || line.contains("</" + rootElement + ">")){
continue;
}
if(line.contains("<"+ element +">")){
tempList = new ArrayList<String>();
}
tempList.add(line);
if(line.contains("</"+ element +">")){
if(hasConditions(tempList)){
writeToSplitFile(tempList);
writtenObjectCounter++;
totalCounter++;
}
}
if(writtenObjectCounter == itemsPerFile){
writtenObjectCounter = 0;
fileCounter++;
tempList.clear();
}
}
if(tempList.size() != 0){
writeClosingRootElement();
}
return totalCounter;
}
private void writeToSplitFile(List<String> itemList) throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
if(writtenObjectCounter == 0){
wr.write("<" + rootElement + ">");
wr.write("\n");
}
for (String string : itemList) {
wr.write(string);
wr.write("\n");
}
if(writtenObjectCounter == itemsPerFile-1)
wr.write("</" + rootElement + ">");
wr.close();
}
private void writeClosingRootElement() throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
wr.write("</" + rootElement + ">");
wr.close();
}
private boolean hasConditions(List<String> list){
int matchList = 0;
for (String condition : conditionList) {
for (String string : list) {
if(string.contains(condition)){
matchList++;
}
}
}
if(matchList >= conditionList.size()){
return true;
}
return false;
}
我知道每个写入staff
元素的打开/关闭流确实影响了性能。但是如果我每个文件写一次(可能包含n个staff
)。自然根和拆分元素是可配置的。
任何想法如何改善性能/逻辑?我更喜欢一些代码,但有时候好的建议会更好
修改
这个XML示例实际上是一个虚拟示例,我正在尝试拆分的真实XML是大约300-500个不同的元素,它们以随机顺序出现在分割元素上,数字也各不相同。毕竟Stax可能不是最好的解决方案吗?
Bounty更新:
我正在寻找一个解决方案(代码):
能够将XML文件拆分为具有x个拆分元素的n个部分(来自虚拟XML示例人员是拆分元素)。
spitted文件的内容应该包装在原始文件的根元素中(就像在虚拟示例公司中一样)
我希望能够指定必须在split元素中的条件,即我只想要有昵称的工作人员,我想丢弃没有昵称的工作人员。但是,在没有条件的情况下运行拆分时,能够无条件地拆分。
代码不一定要改进我的解决方案(缺乏良好的逻辑和性能),但它有效。
并不满意“但它有效”。而且我找不到足够的Stax用于这类操作的例子,用户社区也不是很好。它也不一定是Stax解决方案。
我可能要求的太多,但我在这里学习东西,为我认为的解决方案提供了很好的奖励。
答案 0 :(得分:20)
第一条建议:不要尝试编写自己的XML处理代码。使用XML解析器 - 它将更多更可靠,并且可能更快。
如果您使用XML pull解析器(例如StAX),您应该能够一次读取一个元素并将其写入磁盘,而不是一次性读取整个文档。
答案 1 :(得分:10)
这是我的建议。它需要一个流式XSLT 3.0处理器:这在实践中意味着它需要Saxon-EE 9.3。
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:mode streamable="yes">
<xsl:template match="/">
<xsl:apply-templates select="company/staff"/>
</xsl:template>
<xsl:template match=staff">
<xsl:variable name="v" as="element(staff)">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:if test="$v/nickname">
<xsl:result-document href="{@id}.xml">
<xsl:copy-of select="$v"/>
</xsl:result-document>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
实际上,除非你有数百兆字节的数据,否则我怀疑非流媒体解决方案将足够快,并且可能比你手写的Java代码更快,因为你的Java代码是无关紧要的很兴奋。无论如何,在编写大量低级Java之前,先尝试一下XSLT解决方案。毕竟,这是一个常规问题。
答案 2 :(得分:6)
您可以使用StAX执行以下操作:
<强>算法强>
您的使用案例代码
以下代码使用StAX API来分解您的问题中所述的文档:
package forum7408938;
import java.io.*;
import java.util.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
public class Demo {
public static void main(String[] args) throws Exception {
Demo demo = new Demo();
demo.split("src/forum7408938/input.xml", "nickname");
//demo.split("src/forum7408938/input.xml", null);
}
private void split(String xmlResource, String condition) throws Exception {
XMLEventFactory xef = XMLEventFactory.newFactory();
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
StartDocument startDocument = xef.createStartDocument();
EndDocument endDocument = xef.createEndDocument();
XMLOutputFactory xof = XMLOutputFactory.newFactory();
while(xer.hasNext() && !xer.peek().isEndDocument()) {
boolean metCondition;
XMLEvent xmlEvent = xer.nextTag();
if(!xmlEvent.isStartElement()) {
break;
}
// BOUNTY CRITERIA
// Be able to split XML file into n parts with x split elements(from
// the dummy XML example staff is the split element).
StartElement breakStartElement = xmlEvent.asStartElement();
List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();
// BOUNTY CRITERIA
// I'd like to be able to specify condition that must be in the
// split element i.e. I want only staff which have nickname, I want
// to discard those without nicknames. But be able to also split
// without conditions while running split without conditions.
if(null == condition) {
cachedXMLEvents.add(breakStartElement);
metCondition = true;
} else {
cachedXMLEvents.add(breakStartElement);
xmlEvent = xer.nextEvent();
metCondition = false;
while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
cachedXMLEvents.add(xmlEvent);
if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
metCondition = true;
break;
}
xmlEvent = xer.nextEvent();
}
}
if(metCondition) {
// Create a file for the fragment, the name is derived from the value of the id attribute
FileWriter fileWriter = null;
fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");
// A StAX XMLEventWriter will be used to write the XML fragment
XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
xew.add(startDocument);
// BOUNTY CRITERIA
// The content of the spitted files should be wrapped in the
// root element from the original file(like in the dummy example
// company)
xew.add(rootStartElement);
// Write the XMLEvents that were cached while when we were
// checking the fragment to see if it matched our criteria.
for(XMLEvent cachedEvent : cachedXMLEvents) {
xew.add(cachedEvent);
}
// Write the XMLEvents that we still need to parse from this
// fragment
xmlEvent = xer.nextEvent();
while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
xew.add(xmlEvent);
xmlEvent = xer.nextEvent();
}
xew.add(xmlEvent);
// Close everything we opened
xew.add(xef.createEndElement(rootStartElement.getName(), null));
xew.add(endDocument);
fileWriter.close();
}
}
}
}
答案 3 :(得分:3)
@Jon Skeet在他的建议中像往常一样。 @Blaise Doughan给了你一个使用StAX的基本图片(这是我的首选,尽管你可以用SAX做同样的事情)。你似乎在寻找更明确的东西,所以这里有一些伪代码可以让你入门(基于StAX):
编辑:
哇,我不得不说我为那些愿意为他们做别人工作的人感到惊讶。我没有意识到SO基本上是租借代码的免费版本。答案 4 :(得分:3)
@Gandalf StormCrow: 让我把你的问题分成三个不同的问题: - i)以最佳方式阅读XML和同步拆分
ii)检查拆分文件中的条件
iii)如果满足条件,则处理溢出的文件。
对于i),有多种解决方案:SAX,STAX和其他解析器,就像你提到的那样简单,只需使用简单的java io操作读取并搜索标签。
我相信SAX / STAX /简单的java IO,什么都行。我把你的例子作为我解决方案的基础。
ii)检查拆分文件中的条件:您已使用contains()方法检查是否存在昵称。这似乎不是最好的方法:如果你的条件如同昵称应该存在一样复杂但长度> 5或工资应该是数字等,该怎么办。
我会使用新的java XML验证框架来实现XML模式的使用。请注意我们可以在内存中缓存模式对象,以便一次又一次地重用它。这个新的验证框架非常快。
iii)如果满足条件,则处理溢出的文件。 您可能希望使用java并发API来提交异步任务(ExecutorService类)以实现并行执行以获得更快的性能。
因此,考虑到以上几点,一种可能的解决方案可以是: -
您可以创建一个company.xsd文件,如: -
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org/NewXMLSchema"
xmlns:tns="http://www.example.org/NewXMLSchema"
elementFormDefault="unqualified">
<element name="company">
<complexType>
<sequence>
<element name="staff" type="tns:stafftype"/>
</sequence>
</complexType>
</element>
<complexType name="stafftype">
<sequence>
<element name="firstname" type="string" minOccurs="0" />
<element name="lastname" type="string" minOccurs="0" />
<element name="nickname" type="string" minOccurs="1" />
<element name="salary" type="int" minOccurs="0" />
</sequence>
</complexType>
</schema>
然后你的java代码看起来像: -
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.IOException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;
import org.xml.sax.SAXException;
public class testXML {
// Lookup a factory for the W3C XML Schema language
static SchemaFactory factory = SchemaFactory
.newInstance("http://www.w3.org/2001/XMLSchema");
// Compile the schema.
static File schemaLocation = new File("company.xsd");
static Schema schema = null;
static {
try {
schema = factory.newSchema(schemaLocation);
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private final ExecutorService pool = Executors.newFixedThreadPool(20);;
boolean validate(StringBuffer splitBuffer) {
boolean isValid = false;
Validator validator = schema.newValidator();
try {
validator.validate(new StreamSource(new ByteArrayInputStream(
splitBuffer.toString().getBytes())));
isValid = true;
} catch (SAXException ex) {
System.out.println(ex.getMessage());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return isValid;
}
void split(BufferedReader br, String rootElementName,
String splitElementName) {
StringBuffer splitBuffer = null;
String line = null;
String startRootElement = "<" + rootElementName + ">";
String endRootElement = "</" + rootElementName + ">";
String startSplitElement = "<" + splitElementName + ">";
String endSplitElement = "</" + splitElementName + ">";
String xmlDeclaration = "<?xml version=\"1.0\"";
boolean startFlag = false, endflag = false;
try {
while ((line = br.readLine()) != null) {
if (line.contains(xmlDeclaration)
|| line.contains(startRootElement)
|| line.contains(endRootElement)) {
continue;
}
if (line.contains(startSplitElement)) {
startFlag = true;
endflag = false;
splitBuffer = new StringBuffer(startRootElement);
splitBuffer.append(line);
} else if (line.contains(endSplitElement)) {
endflag = true;
startFlag = false;
splitBuffer.append(line);
splitBuffer.append(endRootElement);
} else if (startFlag) {
splitBuffer.append(line);
}
if (endflag) {
//process splitBuffer
boolean result = validate(splitBuffer);
if (result) {
//send it to a thread for processing further
//it is async so that main thread can continue for next
pool.submit(new ProcessingHandler(splitBuffer));
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
class ProcessingHandler implements Runnable {
String splitXML = null;
ProcessingHandler(StringBuffer splitXMLBuffer) {
this.splitXML = splitXMLBuffer.toString();
}
@Override
public void run() {
// do like writing to a file etc.
}
}
答案 5 :(得分:2)
看看这个。这是来自xmlpull.org的稍微改进的样本:
http://www.xmlpull.org/v1/download/unpacked/doc/quick_intro.html
除非你有嵌套的分割标签,否则你应该做的就是:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
<other>
<staff>
...
</staff>
</other>
</staff>
</company>
要以直通模式运行,只需将null作为拆分标记传递。
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.xmlpull.v1.XmlPullParser;
import org.xmlpull.v1.XmlPullParserException;
import org.xmlpull.v1.XmlPullParserFactory;
public class XppSample {
private String rootTag;
private String splitTag;
private String requiredTag;
private int flushThreshold;
private String fileName;
private String rootTagEnd;
private boolean hasRequiredTag = false;
private int flushCount = 0;
private int fileNo = 0;
private String header;
private XmlPullParser xpp;
private StringBuilder nodeBuf = new StringBuilder();
private StringBuilder fileBuf = new StringBuilder();
public XppSample(String fileName, String rootTag, String splitTag, String requiredTag, int flushThreshold) throws XmlPullParserException, FileNotFoundException {
this.rootTag = rootTag;
rootTagEnd = "</" + rootTag + ">";
this.splitTag = splitTag;
this.requiredTag = requiredTag;
this.flushThreshold = flushThreshold;
this.fileName = fileName;
XmlPullParserFactory factory = XmlPullParserFactory.newInstance(System.getProperty(XmlPullParserFactory.PROPERTY_NAME), null);
factory.setNamespaceAware(true);
xpp = factory.newPullParser();
xpp.setInput(new FileReader(fileName));
}
public void processDocument() throws XmlPullParserException, IOException {
int eventType = xpp.getEventType();
do {
if(eventType == XmlPullParser.START_TAG) {
processStartElement(xpp);
} else if(eventType == XmlPullParser.END_TAG) {
processEndElement(xpp);
} else if(eventType == XmlPullParser.TEXT) {
processText(xpp);
}
eventType = xpp.next();
} while (eventType != XmlPullParser.END_DOCUMENT);
saveFile();
}
public void processStartElement(XmlPullParser xpp) {
int holderForStartAndLength[] = new int[2];
String name = xpp.getName();
char ch[] = xpp.getTextCharacters(holderForStartAndLength);
int start = holderForStartAndLength[0];
int length = holderForStartAndLength[1];
if(name.equals(rootTag)) {
int pos = start + length;
header = new String(ch, 0, pos);
} else {
if(requiredTag==null || name.equals(requiredTag)) {
hasRequiredTag = true;
}
nodeBuf.append(xpp.getText());
}
}
public void flushBuffer() throws IOException {
if(hasRequiredTag) {
fileBuf.append(nodeBuf);
if(((++flushCount)%flushThreshold)==0) {
saveFile();
}
}
nodeBuf = new StringBuilder();
hasRequiredTag = false;
}
public void saveFile() throws IOException {
if(fileBuf.length()>0) {
String splitFile = header + fileBuf.toString() + rootTagEnd;
FileUtils.writeStringToFile(new File((fileNo++) + "_" + fileName), splitFile);
fileBuf = new StringBuilder();
}
}
public void processEndElement (XmlPullParser xpp) throws IOException {
String name = xpp.getName();
if(name.equals(rootTag)) {
flushBuffer();
} else {
nodeBuf.append(xpp.getText());
if(name.equals(splitTag)) {
flushBuffer();
}
}
}
public void processText (XmlPullParser xpp) throws XmlPullParserException {
int holderForStartAndLength[] = new int[2];
char ch[] = xpp.getTextCharacters(holderForStartAndLength);
int start = holderForStartAndLength[0];
int length = holderForStartAndLength[1];
String content = new String(ch, start, length);
nodeBuf.append(content);
}
public static void main (String args[]) throws XmlPullParserException, IOException {
//XppSample app = new XppSample("input.xml", "company", "staff", "nickname", 3);
XppSample app = new XppSample("input.xml", "company", "staff", null, 3);
app.processDocument();
}
}
答案 6 :(得分:1)
通常我会建议使用StAX,但我不清楚你的真实XML是多么“有状态”。如果简单,那么使用SAX获得最佳性能,如果不是那么简单,请使用StAX。所以你需要
现在,似乎步骤3-5是资源最密集的,但我会将它们评为
大多数:1 + 7
中:2 + 6
至少:3 + 4 + 5
由于操作1和7与其他操作分开,你应该以异步方式进行,至少创建多个小文件最好在其他线程中完成,如果你熟悉multi-threading。为了提高性能,您可能还会look into Java中的新IO内容。
现在对于步骤2 + 3和5 + 6你可以用FasterXML走很长的路,它确实做了很多你正在寻找的东西,比如在适当的地方触发JVM热点关注;甚至可能支持异步读/写快速查看代码。
那么我们离开了第5步,根据你的逻辑,你应该
一个。使对象绑定,然后决定如何做什么 湾无论如何写XML,希望最好,如果没有'staff'元素,就扔掉它。
无论你做什么,对象重用都是明智的。请注意,两个备选方案(obisously)需要相同数量的解析(跳过子树ASAP),而对于备选方案b,一点额外的XML实际上并没有那么糟糕的性能,理想情况下确保您的char缓冲区是&gt;一个单元。
备选方案b最容易实现,只需将“xml事件”从您的阅读器复制到编写器,例如StAX:
private static void copyEvent(int event, XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
if (event == XMLStreamConstants.START_ELEMENT) {
String localName = reader.getLocalName();
String namespace = reader.getNamespaceURI();
// TODO check this stuff again before setting in production
if (namespace != null) {
if (writer.getPrefix(namespace) != null) {
writer.writeStartElement(namespace, localName);
} else {
writer.writeStartElement(reader.getPrefix(), localName, namespace);
}
} else {
writer.writeStartElement(localName);
}
// first: namespace definition attributes
if(reader.getNamespaceCount() > 0) {
int namespaces = reader.getNamespaceCount();
for(int i = 0; i < namespaces; i++) {
String namespaceURI = reader.getNamespaceURI(i);
if(writer.getPrefix(namespaceURI) == null) {
String namespacePrefix = reader.getNamespacePrefix(i);
if(namespacePrefix == null) {
writer.writeDefaultNamespace(namespaceURI);
} else {
writer.writeNamespace(namespacePrefix, namespaceURI);
}
}
}
}
int attributes = reader.getAttributeCount();
// the write the rest of the attributes
for (int i = 0; i < attributes; i++) {
String attributeNamespace = reader.getAttributeNamespace(i);
if (attributeNamespace != null && attributeNamespace.length() != 0) {
writer.writeAttribute(attributeNamespace, reader.getAttributeLocalName(i), reader.getAttributeValue(i));
} else {
writer.writeAttribute(reader.getAttributeLocalName(i), reader.getAttributeValue(i));
}
}
} else if (event == XMLStreamConstants.END_ELEMENT) {
writer.writeEndElement();
} else if (event == XMLStreamConstants.CDATA) {
String array = reader.getText();
writer.writeCData(array);
} else if (event == XMLStreamConstants.COMMENT) {
String array = reader.getText();
writer.writeComment(array);
} else if (event == XMLStreamConstants.CHARACTERS) {
String array = reader.getText();
if (array.length() > 0 && !reader.isWhiteSpace()) {
writer.writeCharacters(array);
}
} else if (event == XMLStreamConstants.START_DOCUMENT) {
writer.writeStartDocument();
} else if (event == XMLStreamConstants.END_DOCUMENT) {
writer.writeEndDocument();
}
}
对于一个子树,
private static void copySubTree(XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
reader.require(XMLStreamConstants.START_ELEMENT, null, null);
copyEvent(XMLStreamConstants.START_ELEMENT, reader, writer);
int level = 1;
do {
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
level++;
} else if(event == XMLStreamConstants.END_ELEMENT) {
level--;
}
copyEvent(event, reader, writer);
} while(level > 0);
}
您可以从中扣除如何跳到某个级别。通常,对于有状态StaX解析,请使用模式
private static void parseSubTree(XMLStreamReader reader) throws XMLStreamException {
int level = 1;
do {
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
level++;
// do stateful stuff here
// for child logic:
if(reader.getLocalName().equals("Whatever")) {
parseSubTreeForWhatever(reader);
level --; // read from level 1 to 0 in submethod.
}
// alternatively, faster
if(level == 4) {
parseSubTreeForWhateverAtRelativeLevel4(reader);
level --; // read from level 1 to 0 in submethod.
}
} else if(event == XMLStreamConstants.END_ELEMENT) {
level--;
// do stateful stuff here, too
}
} while(level > 0);
}
您在文档的开头读取到第一个开始元素并中断(添加作者+副本供您使用当然,如上所述)。
请注意,如果执行对象绑定,则应将这些方法放在该对象中,对于序列化方法也应如此。
我很确定你会在现代系统上获得10个MB / s,这应该足够了。需要进一步研究的一个问题是使用多个内核进行实际输入的方法,如果您知道编码子集的事实,如非疯狂的UTF-8或ISO-8859,那么随机访问可能是可能的 - &gt;发送到不同的核心
玩得开心,并告诉它如何使用;)
编辑:几乎忘记了,如果你出于某种原因首先是创建文件的人,或者你将在拆分后阅读它们,那么使用XML将获得巨大的性能提升二值化;存在XML Schema生成器,它们可以再次进入代码生成器。 (而且一些XSLT转换库也使用代码生成。)并使用-server选项运行JVM。
答案 7 :(得分:0)
如何让我更快:
答案 8 :(得分:0)
我的建议是SAX,STAX或DOM不是你问题的理想xml解析器,完美的解决方案叫vtd-xml,有一篇关于这个主题的文章解释了为什么DOM sax和STAX都做了一些事情非常错误......下面的代码是你必须编写的最短代码,但执行速度比DOM或SAX快10倍。 http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html
以下是一篇名为使用Java处理XML - 绩效基准的最新论文:http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
import com.ximpleware.*;
import java.io.*;
public class gandalf {
public static void main(String a[]) throws VTDException, Exception{
VTDGen vg = new VTDGen();
if (vg.parseFile("c:\\xml\\gandalf.txt", false)){
VTDNav vn=vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/company/staff[nickname]");
int i=-1;
int count=0;
while((i=ap.evalXPath())!=-1){
vn.dumpFragment("c:\\xml\\staff"+count+".xml");
count++;
}
}
}
}
答案 9 :(得分:-1)
这是基于DOM的解决方案。我用你提供的xml测试了这个。这需要根据您拥有的实际xml文件进行检查。
由于这是基于DOM解析器,请记住,这将需要大量内存,具体取决于您的xml文件大小。但它的基于DOM的速度要快得多。
算法:
这可以从命令提示符运行,如下所示
java XMLSplitter xmlFileLocation splitElement filter filterElement
对于你提到的xml,它将是
java XMLSplitter input.xml staff true nickname
如果您不想过滤
java XMLSplitter input.xml staff
这是完整的java代码:
package com.xml.xpath;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.DOMException;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class XMLSplitter {
DocumentBuilder builder = null;
XPath xpath = null;
Transformer transformer = null;
String filterElement;
String splitElement;
String xmlFileLocation;
boolean filter = true;
public static void main(String[] arg) throws Exception{
XMLSplitter xMLSplitter = null;
if(arg.length < 4){
if(arg.length < 2){
System.out.println("Insufficient arguments !!!");
System.out.println("Usage: XMLSplitter xmlFileLocation splitElement filter filterElement ");
return;
}else{
System.out.println("Filter is off...");
xMLSplitter = new XMLSplitter();
xMLSplitter.init(arg[0],arg[1],false,null);
}
}else{
xMLSplitter = new XMLSplitter();
xMLSplitter.init(arg[0],arg[1],Boolean.parseBoolean(arg[2]),arg[3]);
}
xMLSplitter.start();
}
public void init(String xmlFileLocation, String splitElement, boolean filter, String filterElement )
throws ParserConfigurationException, TransformerConfigurationException{
//Initialize the Document builder
System.out.println("Initializing..");
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
builder = domFactory.newDocumentBuilder();
//Initialize the transformer
TransformerFactory transformerFactory = TransformerFactory.newInstance();
transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.ENCODING,"UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//Initialize the xpath
XPathFactory factory = XPathFactory.newInstance();
xpath = factory.newXPath();
this.filterElement = filterElement;
this.splitElement = splitElement;
this.xmlFileLocation = xmlFileLocation;
this.filter = filter;
}
public void start() throws Exception{
//Parser the file
System.out.println("Parsing file.");
Document doc = builder. parse(xmlFileLocation);
//Get the root node name
System.out.println("Getting root element.");
XPathExpression rootElementexpr = xpath.compile("/");
Object rootExprResult = rootElementexpr.evaluate(doc, XPathConstants.NODESET);
NodeList rootNode = (NodeList) rootExprResult;
String rootNodeName = rootNode.item(0).getFirstChild().getNodeName();
//Get the list of split elements
XPathExpression expr = xpath.compile("//"+splitElement);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println("Total number of split nodes "+nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
//Wrap each node inside root of the parent xml doc
Node sigleNode = wrappInRootElement(rootNodeName,nodes.item(i));
//Get the XML string of the fragment
String xmlFragment = serializeDocument(sigleNode);
//System.out.println(xmlFragment);
//Write the xml fragment in file.
storeInFile(xmlFragment,i);
}
}
private Node wrappInRootElement(String rootNodeName, Node fragmentDoc)
throws XPathExpressionException, ParserConfigurationException, DOMException,
SAXException, IOException, TransformerException{
//Create empty doc with just root node
DOMImplementation domImplementation = builder.getDOMImplementation();
Document doc = domImplementation.createDocument(null,null,null);
Element theDoc = doc.createElement(rootNodeName);
doc.appendChild(theDoc);
//Insert the fragment inside the root node
InputSource inStream = new InputSource();
String xmlString = serializeDocument(fragmentDoc);
inStream.setCharacterStream(new StringReader(xmlString));
Document fr = builder.parse(inStream);
theDoc.appendChild(doc.importNode(fr.getFirstChild(),true));
return doc;
}
private String serializeDocument(Node doc) throws TransformerException, XPathExpressionException{
if(!serializeThisNode(doc)){
return null;
}
DOMSource domSource = new DOMSource(doc);
StringWriter stringWriter = new StringWriter();
StreamResult streamResult = new StreamResult(stringWriter);
transformer.transform(domSource, streamResult);
String xml = stringWriter.toString();
return xml;
}
//Check whether node is to be stored in file or rejected based on input
private boolean serializeThisNode(Node doc) throws XPathExpressionException{
if(!filter){
return true;
}
XPathExpression filterElementexpr = xpath.compile("//"+filterElement);
Object result = filterElementexpr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
if(nodes.item(0) != null){
return true;
}else{
return false;
}
}
private void storeInFile(String content, int fileIndex) throws IOException{
if(content == null || content.length() == 0){
return;
}
String fileName = splitElement+fileIndex+".xml";
File file = new File(fileName);
if(file.exists()){
System.out.println(" The file "+fileName+" already exists !! cannot create the file with the same name ");
return;
}
FileWriter fileWriter = new FileWriter(file);
fileWriter.write(content);
fileWriter.close();
System.out.println("Generated file "+fileName);
}
}
请告诉我这是否适合您或有关此代码的任何其他帮助。