这可以使代码变得更好,更正确的网页抓取吗?

时间:2012-01-30 15:17:05

标签: vb.net web web-crawler web-scraping

我试图获取某个网站的学校信息,并希望将其保存为excel表格,每个栏目中都有详细信息,以下代码帮助我进一步了解。 列标题:学校名称,吉祥物,地址,类型,电话,传真等我所拥有的学校名单。例如,我使用了一个链接。

Imports System.IO.StreamReader
Imports System.Text.RegularExpressions

Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim request As System.Net.HttpWebRequest = System.Net.WebRequest.Create("http://www.maxpreps.com/high-schools/abbeville-yellowjackets-(abbeville,al)/home.htm")
        Dim response As System.Net.HttpWebResponse = request.GetResponse

        Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
        Dim rsssource As String = sr.ReadToEnd
        Dim r As New System.Text.RegularExpressions.Regex("<h1 id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Header"">.*</h1>")
        Dim r1 As New System.Text.RegularExpressions.Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Mascot"">.*</span>")
        Dim r3 As New System.Text.RegularExpressions.Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Colors"">.*</span>")
        Dim r4 As New System.Text.RegularExpressions.Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_GenderType"">.*</span>")
        Dim r5 As New System.Text.RegularExpressions.Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_AthleteDirectorGenericControl"">.*</span>")
        Dim r6 As New System.Text.RegularExpressions.Regex("<address>.*</address>")
        Dim r7 As New System.Text.RegularExpressions.Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Phone"">.*</span>")
        Dim r8 As New System.Text.RegularExpressions.Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Fax"">.*</span>")

        Dim matches As MatchCollection = r.Matches(rsssource)
        Dim matches1 As MatchCollection = r1.Matches(rsssource)
        Dim matches3 As MatchCollection = r3.Matches(rsssource)
        Dim matches4 As MatchCollection = r4.Matches(rsssource)
        Dim matches5 As MatchCollection = r5.Matches(rsssource)
        Dim matches6 As MatchCollection = r6.Matches(rsssource)
        Dim matches7 As MatchCollection = r7.Matches(rsssource)
        Dim matches8 As MatchCollection = r8.Matches(rsssource)


        For Each itemcode As Match In matches
            ListBox1.Items.Add(itemcode.Value.Split("_").GetValue(4))
            ListBox1.Items.Add(itemcode.Value.Split("><").GetValue(1))
        Next
        For Each itemcode As Match In matches1
            ListBox1.Items.Add(itemcode.Value.Split("_").GetValue(4))
            ListBox1.Items.Add(itemcode.Value.Split("><").GetValue(1))

        Next
    End Sub
End Class

1 个答案:

答案 0 :(得分:1)

您正在寻找Code Review。无论如何,是的,你可以做得更好。首先,您已导入System.Text.RegularExpressions命名空间。您无需完全限定Regex。接下来,您可以在匹配中使用群组

接下来,您可以使用WebClient代替所有HttpWebRequest混乱。这是一个开始:

Imports System.Net
Imports System.Text.RegularExpressions

Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Using wc As New WebClient()
            rssource = wc.DownloadString("http://www.maxpreps.com/high-schools/abbeville-yellowjackets-(abbeville,al)/home.htm")
        End Using

        Dim r  As New Regex("<h1 id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Header"">(.*?)</h1>")
        Dim r1 As New Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Mascot"">(.*?)</span>")
        Dim r3 As New Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Colors"">(.*?)</span>")
        Dim r4 As New Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_GenderType"">(.*?)</span>")
        Dim r5 As New Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_AthleteDirectorGenericControl"">(.*?)</span>")
        Dim r6 As New Regex("<address>(.*)</address>")
        Dim r7 As New Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Phone"">(.*?)</span>")
        Dim r8 As New Regex("<span id=""ctl00_NavigationWithContentOverRelated_ContentOverRelated_Header_Fax"">(.*?)</span>")

        Dim matches As MatchCollection  = r.Matches(rsssource)
        Dim matches1 As MatchCollection = r1.Matches(rsssource)
        Dim matches3 As MatchCollection = r3.Matches(rsssource)
        Dim matches4 As MatchCollection = r4.Matches(rsssource)
        Dim matches5 As MatchCollection = r5.Matches(rsssource)
        Dim matches6 As MatchCollection = r6.Matches(rsssource)
        Dim matches7 As MatchCollection = r7.Matches(rsssource)
        Dim matches8 As MatchCollection = r8.Matches(rsssource)

        For Each itemcode As Match In matches
            'ListBox1.Items.Add(itemcode.Value.Split("_").GetValue(4))
            'Use columns or something instead
            ListBox1.Items.Add(itemcode.Groups(1).Value)
        Next

        For Each itemcode As Match In matches1
            ListBox1.Items.Add(itemcode.Groups(1).Value)
        Next
    End Sub
End Class

接下来,考虑为正则表达式提供有意义的名称,使它们StaticCompiled提高效率,而不是使用正则表达式。哦,并且,改为使用HTML解析器。