尝试从网站上抓取网页到excel,然后转换为word doc

时间:2018-05-02 20:47:52

标签: excel-vba web-scraping vba excel

所以我被指派从网站上删除一些信息。信息非常繁重,我需要点击个人"阅读更多"按钮以获取完整信息。点击&#34后,阅读更多"然后我需要回到上一页,点击"阅读更多"在第二项。然后冲洗并重复,直到我得到我需要的所有信息。一旦我掌握了所有信息,就需要将其转换为word文档。

  1. 导航到网址,我知道该怎么做
  2. 点击"按钮"说更多,问题是有超过70个按钮说"阅读更多"而且我不知道如何区分它们
  3. 一旦我掌握了所有信息,请抓取几条信息,如何让它变得更好?
  4. 一旦收集到excel中的信息,我该如何将其转换为单词?
  5. 代码示例:

    'Bring IE up and navigate to page
          Set ie = New SHDocVw.InternetExplorerMedium
          ie.Visible = True
          'Set the URL
          strURL = "my url"
          'Navigate to url
          ie.Navigate strURL
          'Wait for the page to show up
    

    button.click?

    我对Vba的excel选项还不熟悉,我确信有更简单的方法,但这是我应该这样做的方式。非常感谢任何帮助或提示。

    网址为https://www.legacy.com/obituaries/commercialappeal/browse

1 个答案:

答案 0 :(得分:1)

所以这比预期更棘手,因为我在获得整个结果集时遇到了问题。最后我选择selenium basic,因为它更好地处理了初始页面加载,我没有得到关于cookie等的重复警告。说实话,可能是因为我使用的是Chrome驱动程序!当然可以更改驱动程序,以便支持另一种浏览器类型。

<强>代码:

Option Explicit

'281 Results on 2018-05-04 '16:00
Public Sub test()

    Dim d As WebDriver
    Set d = New ChromeDriver

    With d
        .Start "Chrome"
        .Get "https://www.legacy.com/obituaries/commercialappeal/browse?view=name"

        Dim elements  As List
        Set elements = d.FindElementsByTag("a").Attribute("href")

        Dim hrefCollection As New Collection, i As Long

        For i = 1 To elements.Count
            If InStr(elements(i), "https://www.legacy.com/obituaries/commercialappeal/obituary.aspx?n=") > 0 Then
                If i = 1 Then
                    hrefCollection.Add elements(i)
                ElseIf i > 1 And elements(i) <> elements(i - 1) Then
                    hrefCollection.Add elements(i)
                End If
            End If
        Next i
    End With

    Dim wrdApp As Object, wrdDoc As Object
    Set wrdApp = CreateObject("Word.Application")
    wrdApp.Visible = True

    Set wrdDoc = wrdApp.Documents.Add

    With wrdApp.ActiveDocument.PageSetup
        .Orientation = 1                         'wdOrientLandscape
        .TopMargin = wrdApp.InchesToPoints(0.98)
        .BottomMargin = wrdApp.InchesToPoints(0.98)
        .LeftMargin = wrdApp.InchesToPoints(0.98)
        .RightMargin = wrdApp.InchesToPoints(0.98)
    End With

    With wrdDoc
        .Styles.Add ("SHeading")
        .Styles.Add ("StdText")

        With .Styles("SHeading").Font
            .Name = "Arial"
            .Size = 14
            .Bold = False
            .Underline = True
        End With
        With .Styles("StdText").Font
            .Name = "Arial"
            .Size = 8
            .Bold = False
            .Underline = False
        End With
    End With

    wrdApp.Selection.Collapse Direction:=0       'wdCollapseEnd

    For i = 1 To 2                               '<== Test example to get two results
        DoEvents
        wrdApp.Selection.TypeParagraph
        wrdApp.Selection.Style = wrdDoc.Styles("SHeading")
        wrdApp.Selection.TypeText Text:=GetInfo(hrefCollection.Item(i), d)
    Next i

    '   For Each Item In hrefCollection  ''<== use this above to get all results
    '       DoEvents
    '       wrdApp.Selection.TypeParagraph
    '       wrdApp.Selection.Style = wrdDoc.Styles("SHeading")
    '       wrdApp.Selection.TypeText Text:=GetInfo(hrefCollection.Item(i), d)
    '   Next Item

    d.Quit
End Sub

Public Function GetInfo(ByVal url As String, ByVal d As WebDriver) As String
    With d
        .Get url
        GetInfo = d.FindElementByClass("ObitTextContent").Text
    End With
End Function

注意:

  1. 我不会显示输出,因为不确定ob告是否需要在此网站上。
  2. 感谢@Kyo我劫持的Word代码。你需要将它整理成你需要的东西。