解析大型xml文件以在元素之间添加新行真的很慢

时间:2011-03-13 17:30:55

标签: c# .net xml sql-server-2008 string

我有一个场景,我需要从数据库中提取数据并将其写为xml。问题是用户希望每个元素(DB列)用新行分隔。我正在提取的db表有大约20,000行,并且有很多ntext列(表大小约为3 Gig)。

我将文件分成250行,每个文件大约每个14MB。问题是解析真的很慢。为了在每个元素/列之间添加一个新行,我在db的每个列之间添加一些唯一的字符串,这样我就可以使用Regex.Split函数并为该数组中的每个项添加一个新行。

我确信我的用户错误/无知是因为我主要生活在数据库中,但是我不确定如何尝试加快解析速度。从数据库中提取数据作为xml的速度非常快,写入速度相当快。但是,引入解析并在每个元素之间添加一个新行使每个文件每个文件写入大约3分钟。

对于我应该在C#中使用什么来解析和添加换行的任何建议都将不胜感激。

我一如既往地感谢Stack上的输入/评论。

我用来解析xml数据的代码:

 //parsing the xml anywhere I see the string AddNewLine
 public static void WriteFile(string xml,int fileNum)
    {
        string[] xmlArray = Regex.Split(xml, "AddNewLine");
        string newXml = "";

        //Getting filepath to write file out to
        Connection filePath = new Connection();
        string fileName = filePath.FilePath;

        //foreach item in the array append carriage and new line
        foreach(string xmlRow in xmlArray)
        {
            newXml = newXml + xmlRow + "\n\r\n";
        }

        //use StreamWriter to write file
        using (StreamWriter sw = new StreamWriter(fileName + fileNum + ".xml"))
        {
            sw.Write(newXml);
        }

        //XmlDocument doc = new XmlDocument();
        //doc.LoadXml(newXml);
        //doc.Save(@"C:\TestFileWrite\PatentSchemaNew_" + fileNum + ".xml");   
    }

示例XML输出,我希望每个元素之间有一个新行:

<products>
  <product>
    <ProductID>1</ProductID>
    <!--New Line-->
    <Product>TestProduct1</Product>
    <!--New Line-->
    <ProductDescription>With the introduction of the LE820 Series, Sharp once again establishes its leadership in LCD and LED technology. In a monumental engineering breakthrough, Sharp’s proprietary QuadPixel Technology, a 4-color filter that adds yellow to the traditional RGB, enables more than a trillion colors to be displayed for the first time. A stunning new contemporary edge-light design with full-front glass proudly announces a new AQUOS direction for 2010. The proprietary AQUOS LED system comprised of the X-Gen LCD panel and UltraBrilliant LEDs enables an incredible dynamic contrast ratio of 5,000,000:1 and picture quality that is second to none. The LE820 series is very fully featured, including the addition of Netflix™ streaming video capability through the AQUOS Net™ service, along with the industry’s leading online support system, AQUOS Advantage Live. A built in media player allows for playback of music and photos via USB port.

QuadPixel Technology 4-Color Filter adds yellow to the traditional RGB sub-pixel components, enabling the display of more than a trillion colors.

Full HD 1080p (1920 x 1080) Resolution for the sharpest picture possible.

UltraBrilliant LED System includes a “double-dome” light amplifier lens and multi-fluorescents, enabling high brightness and color purity.

Full HD 1080p X-Gen LCD Panel with 10-bit processing is designed with advanced pixel control to minimize light leakage and wider aperture to let more light through.

120Hz Fine Motion Advanced for fast-motion picture quality.

Wide Viewing Angles (176°H x 176°W) Sharp's AQUOS® LCD TVs’ viewing angles are so wide, you can view the TV clearly from practically anywhere in the room.

High Brightness (450 cd/m2) AQUOS LCD TVs are very bright. You can put them virtually anywhere – even near windows, doors or other light sources – and the picture is still vivid.

AQUOS Net delivers streaming video with Netflix™, customized Internet content and live customer support via Ethernet, viewable in widget, full-screen or split-screen mode.

USB Media Player adds the convenience of viewing high-resolution photos and music on the TV.</ProductDescription>
    <!--New Line-->
    <ProductAccessories> What You'll Need
Add 

Monster Cable MC BNDLF OL150F Bundle HDTV Performance Kit with Flat Panel Wall Bracket 
Monster Cable HT700 8 Outlet Surge Protector
Monster's SurgeGuard™ protects components from harmful surges and...
$208.95
 Get More Performance
Add 

AudioQuest AQ Kit4 1-4ft. and 1-8ft. Black HDTV Performance Pack with HDMI Cables, Screen Cleaner &amp; Mitt 
Uncompressed digital signal for the highest quality picture and sound. One cable for video, audio and control. Two-way communication for expanded system control. Automatic display and source matching for resolution, format and aspect ratio. Computer and gaming compatibility. $79.75
Recommended Accessories
 General Accessory
Add 

Monster Cable ScreenClean 6oz. Ultimate Performance TV Screen Cleaner 
Safe for use on your iPad, iPhone, iPod Touch, laptops, monitors, and TV screens Includes a high-tech reusable MicroFiber cloth that cleans screens without scratching Powerful cleaning solution removes dust, dirt, and oily fingerprints for ultimate clarity Advanced formula cleans without dripping, streaking, or staining like ordinary cleaners    $13.94
Add 
AudioQuest CleanScreen TV Screen Cleaning Kit 
$19.75
 Protection Plans
Add 

TechShield TTL200S5 5-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service) 
Parts and labor coverage with no deductibles No-lemon guarantee 50% value guarantee if you never use the warranty service   $314.95
Add 
TechShield TTL200S3 3-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service) 
$157.95
Add 
TechShield TTL200S4 4-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service) 
$262.95
Add 
TechShield TTL200S2 2-Year Service Warranty for LCD TVs $1,000-$2,000 (In-Home Service) 
$104.95
 Flat Panel Wall Mount - Fixed
Add 

OmniMount OL150F Flat Panel Wall Bracket 
Eco-friendly design and packaging Low mounting profile Includes universal rails and spacers for greater panel compatibility Small footprint provides ample room for power and A/V cutouts behind panel Lift n’ Lock™ allows you to easily attach your flat panel to the mount Sliding lateral on-wall adjustment Locking system secures panel to mount Installation template for simple and accurate mounting Includes end caps for a clean side view Includes complete hardware kit    $99.95
Add 
OmniMount NC200F Black Fixed Wall Mount for 37-63 inch Flat Panels 
$129.95
 Flat Panel Wall Mount - Tilt
Add 

OmniMount NC200T Black Tilt Mount for 37-63 inch Flat Panels 
Universal rails for greater panel compatibility Sliding lateral on-wall adjustment Locking bar works with padlock or screw End caps cover locking hardware and present a clean side view Installation template for simple and accurate mounting $179.95
 Flat Panel Wall Mount - Cantilever/Articulating
Add 

OmniMount UCL-X Platinum Wishbone Cantilever Mount Heavy Duty Dual Arm Double Stud 
Tilt, pan and swivel for maximum viewing flexibility Weight capacity: 200 lbs Double-arm i-beam design for added strength Integrated cable management hides wires Lift and lock mounting system $279.88
Add 
OmniMount NC125C Black Cantilever Mount for 37-52 inch Flat Panels 
$299.95
 Line Conditioner/Surge Protector
Add 

Panamax PM8-GAV Surge Protector with Current Sense Control 
8 Outlets (4 switched, 4 always on) Exclusive Protect or Disconnect circuitry Telephone line protection Cable and Satellite protection  $59.89
Add 
Monster Cable DL MDP 900 Monster Digital PowerCenter MDP 900 w/ Green Power and USB Charging 
$74.77
 HDMI Cable
Add 

AudioQuest HDMI-X 2m (6.56 ft) HDMI Digital Audio Video Cable with Braided Jacket 
Large 1.25% silver conductors Critical Twist Geometry Solid High-Density Polyethylene is used to minimize loss caused by insulation Uncompressed digital signal for the highest quality picture and sound   $40.00
Add 
Icarus ECB-HDM2 2m (6.56 ft) HDMI Digital Audio Video Cable 
$16.95
Add 
Monster Cable MC HDMIB 2m (6.56 ft.) HDMI Cable 
$39.00
 Component Video Cable
Add 

Monster Cable MC 400CV-2m (6.56 ft.) Advanced Performance Component Video Cable 
Get All the High Resolution Picture You Paid For 
Your new DVD player, cable/satellite receiver, and TV might be more advanced... $49.00
Add 
Monster Cable MC 400CV-1m (3.28 ft.) Advanced Performance Component Video Cable 
$39.00
Add 
AudioQuest YIQ-A 2m (6.6 ft) Component Video Cable 
$44.75
 General Accessory
Add 

Monster Cable ScreenClean 6oz. Ultimate Performance TV Screen Cleaner 
Safe for use on your iPad, iPhone, iPod Touch, laptops, monitors, and TV screens Includes a high-tech reusable MicroFiber cloth that cleans screens without scratching Powerful cleaning solution removes dust, dirt, and oily fingerprints for ultimate clarity Advanced formula cleans without dripping, streaking, or staining like ordinary cleaners    $13.94
Add 
AudioQuest CleanScreen TV Screen Cleaning Kit 
$19.75</ProductAccessories>
    <ProductFeatures>Detailed Specifications:
Basic Specifications
10-bit LCD Panel Yes
120HzFrameRate Yes
Aspect Ratio 16:09
Audio System 10W + 10W +15W (Subwoofer)
Backlight System Edge LED
Panel Type X-Gen LCD Panel
Pixel Resolution 1920 x 1080 (x4 sub-pixels) 8 million dots
Response Time 4ms
Tuning System ATSC / QAM / NTSC
Viewing Angles 176° H / 176° V   Features
AQUOS Net Yes
AQUOS AdvantageSM Support Yes
AQUOS® Series Yes
Digital Still Picture Display Yes
Quattron quad pixel technology Yes
Included Accessories
Remote Control Yes
Table Stand Yes Power
Power Consumption AC (watts) 160W
Power Source 120 V, 60 Hz
Terminals
Audio Inputs (L/R) RCA X 2
Composite Video 1
Ethernet Input 1
HD Component 1
HDMI® 4
PC 1 (15-pin D-sub)
RS-232C 1
Weight &amp; Dimensions  Dimensions
Dimensions (wxhxd) (inches) 49-39/64" x 31-59/64" x 1-37/64
Dimensions with Stand(wxhxd) (inches) 49-39/64" x 33-57/64" x 13-25/64" Weight
Product Weight (lbs.) 66.1
Weight with Stand &amp; Speakers (lbs.) 79.4</ProductFeatures>
    <!--New Line-->
    <CreatedDate>2011-03-13T12:59:54.627</CreatedDate>
    <!--New Line-->
    <LastModifiedDate>2011-03-13T12:59:54.627</LastModifiedDate>
    <!--New Line-->
  </product>
</products>

谢谢,

取值

3 个答案:

答案 0 :(得分:6)

如果我正确理解了这个问题并且您输入了14 MB XML文件已经 AddNewLine 分隔符,那么您可能根本不需要加载所有文件并拆分成部分。 - 只需逐行读取输入文件,将 AddNewLine 文本替换为每行中 new line ,其中存在分隔符,并将修改后的行写入新的输出文件。

以下代码会将 \ n \ n \ r \ n 中的 AddNewLine 文本替换为比您的功能更快的几个命令 - 少于1秒。

using (var streamOut = new StreamWriter(outputFileName)
{
  using (var streamIn = new StreamReader(inputFileName)
  {
     while (!streamIn.EndOfStream)
     {
        string line = streamIn.ReadLine();
        line = line.Replace("AddNewLine", "\n\r\n");
        streamOut.WriteLine(line);
     }
  }
}

答案 1 :(得分:2)

我认为你应该调查vtd-xml至少有三个原因:

  1. 解析性能和内存使用情况
  2. 增量更新:DOM的问题在于它将通过拆分输入文档来构造树,然后通过连接将整个事物写回。 VTD-XML不会拆分输入文档,修改方法是直接将空白字符(在您的情况下)插入到docoument的字节表示中。 SAX和Pull也有类似的问题。
  3. 支持xpath和随机访问。
  4. 根据上面给出的信息,我完全希望每个文件的性能低于1秒。你的文件是什么样的?我很乐意提供一些示例代码

    这里是执行空格插入的代码

    using System;
    using System.Text;
    using System.Net;
    using com.ximpleware;
    
                public static void insertWS()
                {            
                        VTDGen vg = new VTDGen();
                        if (vg.parseFile("input.xml",false){
                            VTDNav vn = vg.getNav();
                            AutoPilot ap = new AutoPilot(vn);
                            XMLModifier xm = new XMLModifier(vn);
                            ap.selectXPath("/products/product/*");
                            while(ap.evalXPath()!=-1){
                                xm.insertAfterElement("\n");
                            }
                            xm.output("output.xml");
                        }
                }
    

答案 2 :(得分:2)

如果我是你,我会放弃字符串替换方法并从不同角度处理这个问题。我会在创建xml时添加新行作为xml的一部分,而不是在事后。有点像:

void WriteXml(string xmlFileName, DataRowCollection rows)
{
    var xmlSettings = new XmlWriterSettings { Indent = true };

    using(StreamWriter stream = new StreamWriter(xmlFileName))
    using(XmlWriter writer = XmlWriter.Create(stream, settings))
    {
        writer.WriteStartElement("products");

        foreach(DataRow row in rows)
        {
            writer.WriteStartElement("product");

            writer.WriteElementString("ProductID", row["ProductID"].ToString());

            writer.Flush();
            stream.WriteLine(); //insert new line

            writer.WriteElementString("Product", row["Product"].ToString());

            writer.Flush();
            stream.WriteLine(); //insert new line

            //repeat for rest of columns/elements
            //...

            writer.WriteEndElement(); //end product
        }

        writer.WriteEndElement(); //end products
    }
}