尽可能快地解析XML

时间:2017-06-07 20:10:33

标签: xml vb.net

我有一个VB.Net应用程序,它读取一个包含XML文件的zip文件。我需要将XML文件解析为行段,将一个节点值作为应用程序ID拉出并将其发送到MS SQL数据库。 XML文件如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<PROJECTS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<APPLICATION_ID>9243987</APPLICATION_ID>
<ACTIVITY>P30</ACTIVITY>
<ADMINISTERING_IC>AR</ADMINISTERING_IC>
<APPLICATION_TYPE>5</APPLICATION_TYPE>
<ARRA_FUNDED>N</ARRA_FUNDED>
<AWARD_NOTICE_DATE>05/22/2017</AWARD_NOTICE_DATE>
<BUDGET_START>04/01/2017</BUDGET_START>
</row>
<row>
<APPLICATION_ID>9243988</APPLICATION_ID>
<ACTIVITY>P30</ACTIVITY>
<ADMINISTERING_IC>AR</ADMINISTERING_IC>
<APPLICATION_TYPE>5</APPLICATION_TYPE>
<ARRA_FUNDED>N</ARRA_FUNDED>
<AWARD_NOTICE_DATE>05/22/2017</AWARD_NOTICE_DATE>
<BUDGET_START>04/01/2017</BUDGET_START>
</row>
<row>
<APPLICATION_ID>9243989</APPLICATION_ID>
<ACTIVITY>P30</ACTIVITY>
<ADMINISTERING_IC>AR</ADMINISTERING_IC>
<APPLICATION_TYPE>5</APPLICATION_TYPE>
<ARRA_FUNDED>N</ARRA_FUNDED>
<AWARD_NOTICE_DATE>05/22/2017</AWARD_NOTICE_DATE>
<BUDGET_START>04/01/2017</BUDGET_START>
</row>
</PROJECTS>

该文件可能包含一百万条记录,大小接近100毫克。我目前的代码如下,可能需要8个小时来运行一百万条记录。

我解析文件的VB代码是:

            If ofdXML.ShowDialog <> Windows.Forms.DialogResult.Cancel Then
            stopWatch.Start()
            Dim result As String
            Dim fName As String = ofdXML.FileName
            If fName.EndsWith("zip") Then
                Dim ePath As String = "E:\Downloads\WEEKLY"
                fileName = ExtractArchive(fName, ePath)
                fName = Path.Combine(ePath, fileName)
            End If

            result = Path.GetFileNameWithoutExtension(fName)
            Dim rdr As New StreamReader(fName)
            While (rdr.Peek >= 0)
                varLine = rdr.ReadLine
                sTag = varLine.Contains("<row>")
                eTag = varLine.Contains("</row>")
                If sTag And eTag Then
                    appLine = varLine
                    If appLine.Contains("<row><APPLICATION_ID>") Then
                        appID = appLine.Substring(Len("<row><APPLICATION_ID>"), appLine.IndexOf("/APPLICATION_ID") - Len("<row><APPLICATION_ID>") - 1)
                    End If
                ElseIf sTag Then
                    v1 = True
                    appLine = varLine
                    If appLine.Contains("<row><APPLICATION_ID>") Then
                        appID = appLine.Substring(Len("<row><APPLICATION_ID>"), appLine.IndexOf("/APPLICATION_ID") - Len("<row><APPLICATION_ID>") - 1)
                    End If
                ElseIf eTag Then
                    appLine = appLine & varLine
                    v1 = False
                ElseIf v1 Then
                    appLine = appLine & varLine
                    If appLine.Contains("<APPLICATION_ID>") Then
                        Dim xi As Integer = appLine.IndexOf("_ID>") + 4
                        appID = appLine.Substring(xi, appLine.IndexOf("/APPLICATION_ID") - (xi + 1))
                    End If
                End If


                If Trim(Len(varLine)) > 0 And appLine.Contains("<row>") And appLine.Contains("</row") And Not varLine.Contains("</PROJECTS>") Then
                    TextBox2.Text = i.ToString
                    TextBox3.Text = appID
                    sb.Append(appID + ",")
                    Application.DoEvents()
                    i += 1
                    ADMIN_Save_To_Database(appLine, appID, result)
                End If
            End While

        End If

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:0)

我建议您调查实际的XML解析 - 要么是你可以查询的DOM,要么是SAX你可以&#34;听&#34;至。您只对特定标记感兴趣,因此为该标记设置SAX监听器并忽略其他所有内容应该非常容易。

这应该让你开始:

https://www.tutorialspoint.com/vb.net/vb.net_xml_processing.htm

如果您坚持使用字符串解析查找优化。循环是杀手!如果你能逃脱它,你不想在循环中做昂贵的事情。

例如,您计算&#34;&lt;的长度行&gt;&lt; APPLICATION_ID&gt;&#34;每行两次(取决于格式)。这不仅昂贵,而且结果是不变的!在循环外设置或计算一次。

所有.Contains()调用都非常昂贵。你们中的许多人都是多余的。例如,您检查是否存在&#34;&lt;行&gt;&#34;和&#34;&lt; / row&gt;&#34;靠近循环顶部,然后在循环底部附近再次进行。

简而言之,您最好的选择是XML解析工具。如果您不想这样做,请仔细查看代码以获取昂贵的操作,您可以将其完全拉出循环,也可以每次只执行一次。

答案 1 :(得分:0)

我已将代码更改为:

            Dim rdr As New StreamReader(fName)
            Dim xml As New XmlDocument()
            xml.Load(rdr)
            Dim DocumentNodes As XmlNodeList = 
            xml.GetElementsByTagName("row")
            For Each xn As XmlNode In DocumentNodes
                Dim example As XmlNode = 
             xn.SelectSingleNode("APPLICATION_ID")
                If example IsNot Nothing Then
                    Dim applicationID As String = example.InnerText
                    ADMIN_Save_AuthoringNames(xn.InnerXml, applicationID, result)
                End If
            Next

我会告诉你它是如何运行的