Excel VBA:从HTML中提取Image Src属性为字符串

时间:2016-09-28 09:56:16

标签: excel-vba dom web web-scraping dom-traversal

我正试图抓住我的雇主网站从他们的Blog post en mass中提取图像。我已经开始使用VBA在Excel中创建一个抓取工具。

(我们无法访问SQL数据库)

我已经设置了一个工作表,其中包含A列中的帖子标识符列表和B列中帖子的URL。

到目前为止,我的VBA脚本遍历B列中的URL列表,使用getElementById从页面上的标签中提取HTML,并将结果输出作为字符串粘贴到C列中。

我现在正处于试图弄清楚如何从结果输出中的每个图像中提取src属性并将其粘贴到相关列中的位置。我不能为我的生活提出一个简单的解决方案。我对RegEx不太熟悉,并且正在努力使用Excel内置的字符串函数。

最终游戏是让宏运行每个图像网址并将图像保存到磁盘,文件名格式为" {事件编号} - {图像编号}" .jpg

非常感谢任何帮助。

Worksheet setup

Sub Get_Image_SRC()

Dim sht As Worksheet
Dim LastRow As Long
Dim i As Integer
Dim url As String
Dim IE As Object
Dim objElement As Object
Dim objCollection As Object
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement


Set sht = ThisWorkbook.Worksheets("Sheet1")
'Ctrl + Shift + End
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
For i = 2 To LastRow
    url = Cells(i, "C").Value
    MsgBox (url)
    IE.navigate url
    Application.StatusBar = url & " is loading..."
    Do While IE.readyState = 4: DoEvents: Loop
    Do Until IE.readyState = 4: DoEvents: Loop
    Application.StatusBar = url & " Loaded"
    If Cells(i, "B").Value = "WEBNEWS" Then
        Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML
       Else
        Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML
    End If



Next i

Set IE = Nothing
Set objElement = Nothing
Set objCollection = Nothing

End Sub

生成HTML的示例:

<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div>    <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch. 

Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone!

<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://www.pinkbike.com/v/188244""><embed width=""500"" height=""281"" src=""http://www.pinkbike.com/v/188244"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p>


</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>"

My current references

2 个答案:

答案 0 :(得分:1)

如果您可以使用Wget轻松完成此操作,那么使用VBA:How do I use Wget to download all Images into a single Folder

答案 1 :(得分:0)

对于正则表达式方法,您应该查看以下两个链接:

基本归结为:

  • src获取img属性值的正则表达式为src\s*=\s*"(.+?)"
  • 使用VBScript.RegExp库在VBA中使用正则表达式

我使用了后期绑定,但如果需要,可以包含引用。

然后VBA就是这样:

选项明确

子测试()

Dim strHtml As String

' sample html, note single img tag
strHtml = ""
strHtml = strHtml & "<div id=""foo"">"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""fred"" src=""\\server\path\picture1.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""ned"" src=""\\server\path\picture2.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""teddy"" src=""\\server\path\picture3.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "</div>"

Dim strSrc As String
Dim objRegex As Object
Dim objMatches As Object
Dim lngMatchCount As Long, lngCounter As Long

' create regex
Set objRegex = CreateObject("VBScript.RegExp")

' set pattern and execute
With objRegex
    .IgnoreCase = True
    .Pattern = "src\s*=\s*""(.+?)"""
    .Global = True

    If .Test(strHtml) Then
        Set objMatches = .Execute(strHtml)
        lngMatchCount = objMatches.Count
        For lngCounter = 0 To lngMatchCount - 1
            strSrc = objMatches(lngCounter).SubMatches(0)
            ' youve successfully captured the img src value
            Debug.Print strSrc
        Next
    Else
        strSrc = "Not found"
    End If
End With

End Sub

请注意,我正在获取SubMatches集合的第一项,以获取src属性的值。此代码中objMatches(0)objMatches(0).SubMatches(0)之间的差异为:

src="\\server\path\picture.png"

对战:

\\server\path\picture.png

您可能希望将其作为一个函数包装起来,并在代码的IE.document.getElementById("NewsDetail").outerHTML块中计算出If..End If的值时调用它。