如何拆分大型xml文件?

时间:2010-12-01 15:17:52

标签: xml windows

我们将“记录”导出到xml文件;我们的一位客户抱怨该文件太大而无法处理其他系统。因此,我需要拆分文件,同时在每个新文件中重复“标题部分”。

所以我正在寻找能让我为应该总是输出的部分定义一些xpath的东西,以及“rows”的另一个xpath,其中的参数表示要在每个文件中放入多少行如何命名文件。

在我开始编写一些自定义.net代码之前; 是否有一个标准的命令行工具可以在Windows上运行

(因为我知道如何用C#编程,我更多地编写代码然后尝试搞乱复杂的xsl等,但是“自我”解决方案会比自定义代码更好。)

7 个答案:

答案 0 :(得分:3)

没有通用的解决方案,因为源XML的结构有很多种不同的可能方式。

构建一个将输出XML文档片段的XSLT转换是相当简单的。例如,给定这个XML:

<header>
  <data rec="1"/>
  <data rec="2"/>
  <data rec="3"/>
  <data rec="4"/>
  <data rec="5"/>
  <data rec="6"/>
</header>

您可以使用此XSLT输出仅包含特定范围内data个元素的文件的副本:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  <xsl:param name="startPosition"/>
  <xsl:param name="endPosition"/>

  <xsl:template match="@* | node()">
      <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
      </xsl:copy> 
  </xsl:template>

  <xsl:template match="header">
    <xsl:copy>
      <xsl:apply-templates select="data"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="data">
    <xsl:if test="position() &gt;= $startPosition and position() &lt;= $endPosition">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
      </xsl:copy>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

(顺便说一句,请注意,因为这是基于身份转换,即使header不是顶级元素,它仍然有效。)

您仍然需要计算源XML中的data元素,并使用适合该情况的$startPosition$endPosition值重复运行转换。

答案 1 :(得分:3)

首先从此链接http://www.firstobject.com/foxe242.zip

下载foxe xml编辑器

观看该视频http://www.firstobject.com/xml-splitter-script-video.htm 视频解释了分割代码的工作原理。

该页面上有一个脚本代码(以split()开头)复制代码,在xml编辑器程序中,在“文件”下创建一个“新程序”。粘贴代码并保存。代码是:

split()
{
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "**50MB.xml**", MDF_READFILE );
  int nObjectCount = 0, nFileCount = 0;
  while ( xmlInput.FindElem("//**ACT**") )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( "**root**" );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == **5** )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

根据需要更改粗体标记(或** **标记)字段。 (这也在视频页面上表达)

在xml编辑器窗口中右键单击并单击RUN(或简称F9)。窗口上有一个输出栏,显示生成的文件数。

注意: 输入文件名可以是"C:\\Users\\AUser\\Desktop\\a_xml_file.xml"(双斜线) 并输出文件"C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"

答案 2 :(得分:2)

xml_split - 将大型XML文档拆分为更小的块

http://www.perlmonks.org/index.pl?node_id=429707

http://metacpan.org/pod/XML::Twig

答案 3 :(得分:2)

如前所述,Perl package XML::Twig中的xml_split表现非常出色。

用法

xml_split < bigFile.xml

#or if compressed e.g.
bzcat bigFile.xml.bz2 | xml_split

没有任何参数xml_split为每个顶级子节点创建一个文件。

parameters来指定每个文件所需的元素数量(-g)或近似大小(-s <Kb|Mb|Gb>)。

安装

Look here

的Linux

sudo apt-get install xml-twig-tools

答案 4 :(得分:1)

内置任何东西都无法轻易应对这种情况。

你的方法听起来很合理,但我可能会从一个“骨架”文档开始,其中包含需要重复的元素,并使用“记录”生成多个文档。


更新

经过一番挖掘,我发现this文章描述了使用XSLT分割文件的方法。

答案 5 :(得分:0)

使用基于https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704

的Ultraedit

我添加的只是一些XML页眉和页脚位 需要手动修复第一个和最后一个文件(或从源中删除根元素)。

    // from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704 

var FoundsPerFile = 200;      // Global setting for number of found split strings per file.
var SplitString = "</letter>";  // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';

/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
   var tabindex = -1; /* start value */

   for (var i = 0; i < UltraEdit.document.length; i++)
   {
      if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
         tabindex = i;
         break;
      }
   }
   return tabindex;
}

if (UltraEdit.document.length) { // Is any file open?
   // Set working environment required for this job.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   UltraEdit.ueReOn();

   // Move cursor to top of active file and run the initial search.
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   // If the string to split is not found in this file, do nothing.
   if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
      // This file is probably the correct file for this script.
      var FileNumber = 1;    // Counts the number of saved files.
      var StringsFound = 1;  // Counts the number of found split strings.
      var NewFileIndex = UltraEdit.document.length;
      /* Get the path of the current file to save the new
         files in the same directory as the current file. */
      var SavePath = "";
      var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\\");
      if (LastBackSlash >= 0) {
         LastBackSlash++;
         SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
      }
      /* Get active file index in case of more than 1 file is open and the
         current file does not get back the focus after closing the new files. */
      var FileToSplit = getActiveDocumentIndex();
      // Always use clipboard 9 for this script and not the Windows clipboard.
      UltraEdit.selectClipboard(9);
      // Split the file after every x found split strings until source file is empty.
      while (1) {
         while (StringsFound < FoundsPerFile) {
            if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
            else {
               UltraEdit.document[FileToSplit].bottom();
               break;
            }
         }
         // End the selection of the find command.
         UltraEdit.document[FileToSplit].endSelect();
         // Move the cursor right to include the next character and unselect the found string.
         UltraEdit.document[FileToSplit].key("RIGHT ARROW");
         // Select from this cursor position everything to top of the file.
         UltraEdit.document[FileToSplit].selectToTop();
         // Is the file not already empty?
         if (UltraEdit.document[FileToSplit].isSel()) {
            // Cut the selection and paste it into a new file.
            UltraEdit.document[FileToSplit].cut();
            UltraEdit.newFile();
            UltraEdit.document[NewFileIndex].setActive();
            UltraEdit.activeDocument.paste();


            /* Add line termination on the last line and remove automatically added indent
               spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
            if (UltraEdit.activeDocument.isColNumGt(1)) {
               UltraEdit.activeDocument.insertLine();
               if (UltraEdit.activeDocument.isColNumGt(1)) {
                  UltraEdit.activeDocument.deleteToStartOfLine();
               }
            }

            // add headers and footers 

            UltraEdit.activeDocument.top();
            UltraEdit.activeDocument.write(xmlHead);
                        UltraEdit.activeDocument.write(xmlRootStart);
            UltraEdit.activeDocument.bottom();
            UltraEdit.activeDocument.write(xmlRootEnd);
            // Build the file name for this new file.
            var SaveFileName = SavePath + "LETTER";
            if (FileNumber < 10) SaveFileName += "0";
            SaveFileName += String(FileNumber) + ".raw.xml";
            // Save the new file and close it.
            UltraEdit.saveAs(SaveFileName);
            UltraEdit.closeFile(SaveFileName,2);
            FileNumber++;
            StringsFound = 0;
            /* Delete the line termination in the source file
               if last found split string was at end of a line. */
            UltraEdit.document[FileToSplit].endSelect();
            UltraEdit.document[FileToSplit].key("END");
            if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
               UltraEdit.document[FileToSplit].top();
            } else {
               UltraEdit.document[FileToSplit].deleteLine();
            }
         } else break;
            UltraEdit.outputWindow.write("Progress " + SaveFileName);
      }  // Loop executed until source file is empty!

      // Close source file without saving and re-open it.
      var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
      UltraEdit.closeFile(NameOfFileToSplit,2);
      /* The following code line could be commented if the source
         file is not needed anymore for further actions. */
      UltraEdit.open(NameOfFileToSplit);

      // Free memory and switch back to Windows clipboard.
      UltraEdit.clearClipboard();
      UltraEdit.selectClipboard(0);
   }
}

答案 6 :(得分:-2)

“有没有一个标准的命令行工具可以在Windows上运行它?”

是。 http://xponentsoftware.com/xmlSplit.aspx