使用VBA面对错误从网站中提取超链接

时间:2018-09-29 15:31:50

标签: html excel vba web-scraping hyperlink

我正在尝试从我输入的网页中提取包含“ http://www.bursamalaysia.com/market/listed-companies/company-announcements/”的所有超链接。

首先,代码运行良好,但是之后,我遇到了无法提取所需的url链接的问题。每当我运行潜水艇时,它就会丢失。enter image description here

链接:http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=SH&sub_category=all&alphabetical=All

Sub scrapeHyperlinks()

    Dim IE As InternetExplorer
    Dim html As HTMLDocument
    Dim ElementCol As Object
    Dim Link As Object
    Dim erow As Long
    Application.ScreenUpdating = False
    Set IE = New InternetExplorer


    For u = 1 To 50
    IE.Visible = False
    IE.navigate Cells(u, 2).Value
    Do While IE.readyState <> READYSTATE_COMPLETE
    Application.StatusBar = "Trying to go to websitehahaha"
    DoEvents

    Loop
    Set html = IE.document
    Set ElementCol = html.getElementsByTagName("a")
    For Each Link In ElementCol
    erow = Worksheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Offset(1, 0).Row
    Cells(erow, 1).Value = Link
    Cells(erow, 1).Columns.AutoFit
    Next
    Next u

    ActiveSheet.Range("$A$1:$A$152184").AutoFilter Field:=1, Criteria1:="http://www.bursamalaysia.com/market/listed-companies/company-announcements/???????", Operator:=xlAnd

    For k = 1 To [A65536].End(xlUp).Row
    If Rows(k).Hidden = True Then
    Rows(k).EntireRow.Delete
    k = k - 1
    End If
    Next k


    Set IE = Nothing
    Application.StatusBar = ""
    Application.ScreenUpdating = True
End Sub

1 个答案:

答案 0 :(得分:1)

仅从给定的URL中获取您提到的合格 private void BindGrid() { DataTable dt = new DataTable(); String strConnString = System.Configuration.ConfigurationManager.ConnectionStrings["connStr"].ConnectionString; MySqlConnection con = new MySqlConnection(strConnString); MySqlDataAdapter sda = new MySqlDataAdapter(); MySqlCommand cmd = new MySqlCommand("GetApprovedData1"); cmd.CommandType = CommandType.StoredProcedure; DateTime? dateValue = null; if (ViewState["Date"] != null && ViewState["Date"].ToString() != "0") { dateValue = DateTime.Parse(ViewState["Date"].ToString()); } cmd.Parameters.AddWithValue("dateValue", dateValue); cmd.Connection = con; sda.SelectCommand = cmd; sda.Fill(dt); gdvTM.DataSource = dt; int i = dt.Rows.Count; gdvTM.DataBind(); this.BindDropDownList(); TableCell cell = gdvTM.HeaderRow.Cells[0]; setDropdownselectedItem(ViewState["Date"] != null ? (string)ViewState["Date"] : string.Empty, cell.FindControl("ddlgvdate") as DropDownList); } ,我将使用以下内容。它使用CSS选择器组合来定位指定页面中感兴趣的URL。

CSS选择器组合为

hrefs

这是descendant selector,用于查找属性值为#bm_ajax_container [href^='/market/listed-companies/company-announcements/'] 的元素,其值以href开头,并具有ID为/market/listed-companies/company-announcements/的父元素。该父元素是ajax容器div。 "#"是一个ID选择器,而“ []”则是一个属性选择器。 bm_ajax_container的意思是开头。

容器div和第一个匹配的href示例:

由于要匹配多个元素,因此会通过"^"方法应用CSS选择器组合。这将返回一个querySelectorAll,其nodeList可以遍历以通过索引访问单个项目。

完整的合格链接被写到工作表中。


使用选择器(示例)的页面示例CSS查询结果:

enter image description here


VBA:

.Length