我正试图抓住我的雇主网站从他们的Blog post en mass中提取图像。我已经开始使用VBA在Excel中创建一个抓取工具。
(我们无法访问SQL数据库)
我已经设置了一个工作表,其中包含A列中的帖子标识符列表和B列中帖子的URL。
到目前为止,我的VBA脚本遍历B列中的URL列表,使用getElementById从页面上的标签中提取HTML,并将结果输出作为字符串粘贴到C列中。
我现在正处于试图弄清楚如何从结果输出中的每个图像中提取src属性并将其粘贴到相关列中的位置。我不能为我的生活提出一个简单的解决方案。我对RegEx不太熟悉,并且正在努力使用Excel内置的字符串函数。
最终游戏是让宏运行每个图像网址并将图像保存到磁盘,文件名格式为" {事件编号} - {图像编号}" .jpg
非常感谢任何帮助。
Sub Get_Image_SRC()
Dim sht As Worksheet
Dim LastRow As Long
Dim i As Integer
Dim url As String
Dim IE As Object
Dim objElement As Object
Dim objCollection As Object
Dim Elements As IHTMLElementCollection
Dim Element As IHTMLElement
Set sht = ThisWorkbook.Worksheets("Sheet1")
'Ctrl + Shift + End
LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
For i = 2 To LastRow
url = Cells(i, "C").Value
MsgBox (url)
IE.navigate url
Application.StatusBar = url & " is loading..."
Do While IE.readyState = 4: DoEvents: Loop
Do Until IE.readyState = 4: DoEvents: Loop
Application.StatusBar = url & " Loaded"
If Cells(i, "B").Value = "WEBNEWS" Then
Cells(i, "D").Value = IE.document.getElementById("NewsDetail").outerHTML
Else
Cells(i, "D").Value = IE.document.getElementById("ReviewContainer").outerHTML
End If
Next i
Set IE = Nothing
Set objElement = Nothing
Set objCollection = Nothing
End Sub
生成HTML的示例:
<div id=""NewsDetail""><div class=""NewsDetailTitle"">Video: Race Face Behind the Scenes Tour</div><div class=""NewsDetailImage""><img alt=""HeadlinesThumbnail.jpg"" src=""/ImageHandler/6190/515/1000/0/""></div> <div class=""NewsDetailBody"">Pinkbike posted this video a while ago, if you missed it, its' definitely worth a watch.
Ken from Camp of Champions took a look at their New Westminster factory last year which gives a look at the production, people and culture of Race Face. The staff at Race Face are truly their greatest asset they had, best wishes to everyone!
<p><center><object width=""500"" height=""281""><param name=""allowFullScreen"" value=""true""><param name=""AllowScriptAccess"" value=""always""><param name=""movie"" value=""http://www.pinkbike.com/v/188244""><embed width=""500"" height=""281"" src=""http://www.pinkbike.com/v/188244"" type=""application/x-shockwave-flash"" allowscriptaccess=""always"" allowfullscreen=""true""></object></center><p></p>
</div><div class=""NewsDate"">Published Friday, 25 November 2011</div></div>"
答案 0 :(得分:1)
如果您可以使用Wget轻松完成此操作,那么使用VBA:How do I use Wget to download all Images into a single Folder。
答案 1 :(得分:0)
对于正则表达式方法,您应该查看以下两个链接:
基本归结为:
src
获取img
属性值的正则表达式为src\s*=\s*"(.+?)"
VBScript.RegExp
库在VBA中使用正则表达式我使用了后期绑定,但如果需要,可以包含引用。
然后VBA就是这样:
选项明确
子测试()
Dim strHtml As String
' sample html, note single img tag
strHtml = ""
strHtml = strHtml & "<div id=""foo"">"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""fred"" src=""\\server\path\picture1.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""ned"" src=""\\server\path\picture2.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "<bar class=""baz"">"
strHtml = strHtml & "<img alt=""teddy"" src=""\\server\path\picture3.png"" />"
strHtml = strHtml & "</bar>"
strHtml = strHtml & "</div>"
Dim strSrc As String
Dim objRegex As Object
Dim objMatches As Object
Dim lngMatchCount As Long, lngCounter As Long
' create regex
Set objRegex = CreateObject("VBScript.RegExp")
' set pattern and execute
With objRegex
.IgnoreCase = True
.Pattern = "src\s*=\s*""(.+?)"""
.Global = True
If .Test(strHtml) Then
Set objMatches = .Execute(strHtml)
lngMatchCount = objMatches.Count
For lngCounter = 0 To lngMatchCount - 1
strSrc = objMatches(lngCounter).SubMatches(0)
' youve successfully captured the img src value
Debug.Print strSrc
Next
Else
strSrc = "Not found"
End If
End With
End Sub
请注意,我正在获取SubMatches
集合的第一项,以获取src
属性的值。此代码中objMatches(0)
和objMatches(0).SubMatches(0)
之间的差异为:
src="\\server\path\picture.png"
对战:
\\server\path\picture.png
您可能希望将其作为一个函数包装起来,并在代码的IE.document.getElementById("NewsDetail").outerHTML
块中计算出If..End If
的值时调用它。