Question

我正在使用

javax.xml.stream.XMLInputFactory

浏览一个13GB的wikipedia xml文件。

现在我想知道<page>标记行在哪个字节位置开始，以便我可以跳转并阅读它。

以下是一些代码：

inputStream = new FileInputStream(xmlFile); // I am free to change this

XMLInputFactory inputFactory = XMLInputFactory.newInstance(); // maybe there is a better way?
eventReader = inputFactory.createXMLEventReader(inputStream);


// this is in a loop
event = eventReader.nextEvent();

if (event.isStartElement()) {
    StartElement startElement = event.asStartElement();

    if (startElement.getName().getLocalPart() == "page") {
         // !!! here I want to know the byte position in the file
    }
}

我尝试了什么：

inputStream.getChannel().position()

和

inputStream.getChannel().position(...)

跳转到标签所在的位置并读取标签。但这不起作用，因为eventReader读取大约8000字节的块。

Answer 1

您需要添加编码：

eventReader = inputFactory.createXMLEventReader(inputStream, "ASCII");

Answer 2

要了解XML流中元素的来源，请调用getLocation()。

你不能用它来神奇地让XML读取过程从中间开始，必须按顺序读取XML文件。

java - 读取大型xml文件并获取元素的字节位置

2 个答案: