我有大约25000个XML文件,需要用Java读取。这是我的代码:
private static void ProcessFile() {
try {
File fXmlFile = new File("C:/Users/Emolk/Desktop/000010.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("sindex");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("");
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Name : " + eElement.getElementsByTagName("name").item(0).getTextContent());
System.out.println("Count : " + eElement.getElementsByTagName("count").item(0).getTextContent());
Entity CE = new Entity(eElement.getElementsByTagName("name").item(0).getTextContent(), Integer.parseInt(eElement.getElementsByTagName("count").item(0).getTextContent()));
Entities.add(CE);
System.out.println("Entity added! ");
}
}
System.out.println(Entities);
} catch (Exception e) {
e.printStackTrace();
}
}
我如何读取25000个文件而不是一个?
我尝试使用以下方法将所有xml文件连接在一起:https://www.sobolsoft.com/howtouse/combine-xml-files.htm
但这给了我这个错误:
[Fatal Error] joined.xml:130:2: The markup in the document following the
root element must be well-formed.
答案 0 :(得分:0)
如果您不关心性能,则可以执行类似的操作
import java.io.File;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class ReadFiles {
public static void main(String[] args) {
File dir = new File("D:/Work"); //Directory where your file exists
File [] files = dir.listFiles();
for(File file : files) {
if(file.isFile() && file.getName().endsWith(".xml")) { //You can validate file name with extension if needed
ProcessFile(file, Entities); // Assumed you have declared Entities, may be list of other collection
}
}
System.out.println(Entities);
}
private static void ProcessFile(File fXmlFile, List<E> Entities) {
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("sindex");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("");
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Name : " + eElement.getElementsByTagName("name").item(0).getTextContent());
System.out.println("Count : " + eElement.getElementsByTagName("count").item(0).getTextContent());
Entity CE = new Entity(eElement.getElementsByTagName("name").item(0).getTextContent(), Integer.parseInt(eElement.getElementsByTagName("count").item(0).getTextContent()));
Entities.add(CE);
System.out.println("Entity added! ");
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
答案 1 :(得分:0)
要读取多文件,您应该使用某种循环进行迭代。您可以扫描目录中的所有有效文件。
File folder = new File("path/to/directory");
File[] files = folder.listFiles();
for (int i = 0; i < files.length; i++) {
// you can also filter for .xml if needed
if (files[i].isFile()) {
// parse the file
}
}
接下来,您需要确定解析文件的方式:顺序或并行。 由于您使用多个线程来解析文件,因此并行处理要快得多。
您可以重用已经编写的代码,并遍历文件:
for (File file : files) {
processFile(file, yourListOfEntities);
}
获取ScheduledExecutorService
并提交多个任务。
ExecutorService service = Executors.newFixedThreadPool(5);
for (File file : files) {
service.execute(() -> processFile(file, yourListOfEntities));
}
此处的重要说明:ArrayList
的默认实现不是线程安全的,因此您应该(由于List
被多个线程使用)同步对其的访问:
List<Entity> synchronizedList = Collections.synchronizedList(yourListOfEntities);
此外,DocumentBuilder
也不是线程安全的,应该为每个线程创建一次(如果您只调用方法,就可以使用它)。如果您考虑对其进行优化,则此注释仅适用于这种情况。