从do循环中调用函数,嵌入在VBA中的另一个函数中用于抓取数据

时间:2014-05-27 09:59:54

标签: vba loops excel-vba web-scraping excel

这个问题会很长,我提前道歉。希望它仍然很清楚:我正试图在VBA中建立一个刮刀,从Google专利中提取专利引用。

为此,我有两个Excel工作表。在Sheets(“recent_patents”)中存储了一堆专利号:

column A: 1 2 3 4 5 6 7 8 9 10
column B: 6793915 6797730 6805855 6808712 6812240 6815452 6818218 6818642 6824791 6828324

第二张表格(“rec_out”)包含需要填充提取数据的7列(A到G):

A: Progress B: Input patent C: # fwd cites  D: fwd_cite E: app_date F: grant_date G: Control Patent

棘手的部分(对我来说)是每个专利被引用的次数不同,所以我必须在innerHTMLcitecount中构建一个寻找相同数据(fwd_cite,app_date,grant_date)的循环此循环的迭代次数总是不同。

所以我想把问题分成两个不同的函数extractcitesccount。前者打开相关网页并提取前向引用次数并将其存储在变量Sub CountForwardCites中,而第二个函数则可以循环1到ccount。这两个函数目前都放在Call extractcites

这(我认为)迫使我将 Private Sub CountForwardCites_Click() ' First you need to announce the needed variables Dim ie As Object Dim patent As String Dim ccount As Integer Dim patent_number As String Dim the_HTML_To_Parse As String Dim fwd_cite As String Dim app_date As String Dim grant_date As String Worksheets("recent_patents").Select 'Creating an input row where the data is sourced from input_row = 1 Do DoEvents input_row = input_row + 1 patent_number = Sheets("recent_patents").Range("B" & input_row).Value Call citecount(patent_number, patent, ccount, fwd_cite, app_date, grant_date) ' Step 1: Repeat patent_number ccount times, add control patent, and ccount Worksheets("rec_out").Select sumcount = Sheets("rec_out").Range("B:B").Cells.SpecialCells(xlCellTypeConstants).Count ' this ensures the code does not always start filling up columns from row 2 countrows = sumcount If ccount > 0 Then Do DoEvents countrows = countrows + 1 Sheets("rec_out").Range("A" & countrows) = Sheets("recent_patents").Range("A" & input_row).Value Sheets("rec_out").Range("B" & countrows) = patent_number Sheets("rec_out").Range("C" & countrows) = countrows - sumcount Sheets("rec_out").Range("G" & countrows) = patent 'Call GetForwardCites(the_HTML_To_Parse, fwd_cte, app_date, grant_date) ' Step 2: Get the patent citations Call extractcites(patent_number, the_HTML_To_Parse, fwd_cite, app_date, grant_date) Sheets("rec_out").Range("D" & countrows) = fwd_cite Sheets("rec_out").Range("E" & countrows) = grant_date Sheets("rec_out").Range("F" & countrows) = app_date Loop Until countrows = ccount + sumcount Else countrows = countrows + 1 Sheets("rec_out").Range("B" & countrows) = patent_number Sheets("rec_out").Range("C" & countrows) = ccount Sheets("rec_out").Range("D" & countrows) = ccount Sheets("rec_out").Range("E" & countrows) = "NA" Sheets("rec_out").Range("F" & countrows) = "NA" End If Loop Until input_row = 22500 End Sub 置于循环中,这总是给我一个错误。

以下是当前代码:

 Function extractcites(patent_number As String, the_HTML_To_Parse As String, fwd_cite As String, app_date As String, grant_date As String)

the_HTML_To_Parse = the_HTML_ToParse ' Transfer of the innerHTML string generated in citecount
ForwardCitesScraper.ScrapedCode.Text = the_HTML_To_Parse
fwd_cite = ""
app_date = ""
grant_date = ""

If patent_number = "" Then Exit Function

' Parse HTML up to "Referenced by</SPAN></DIV>" string is (deletes text before).
' I do this because that tag announces the start of the table I'm interested in

If InStr(the_HTML_To_Parse, "Referenced by</SPAN>") > 0 Then

the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "Referenced by</SPAN></DIV>"), Len(the_HTML_To_Parse))

    ' Extract Forward citation Publication Number
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "/patents/") + 9, Len(the_HTML_To_Parse))
    fwd_cite = Mid(the_HTML_To_Parse, 1, InStr(the_HTML_ToParse, "</A></TD>") - 1)
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "-date-value"), Len(the_HTML_To_Parse))
    ' Extract Forward citation application date
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "-date-value") + 13, Len(the_HTML_To_Parse))
    app_date = Mid(the_HTML_To_Parse, 1, InStr(the_HTML_To_Parse, "</TD>") - 1)
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "patent-date-value"), Len(the_HTML_To_Parse))
    ' Extract Forward citation grant date
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "patent-date-value") + 19, Len(the_HTML_To_Parse))
    grant_date = Mid(the_HTML_To_Parse, 1, InStr(the_HTML_To_Parse, "</TD>") - 1)

Else:
    ' Extract Forward citation Publication Number
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "/patents/") + 9, Len(the_HTML_To_Parse))
    fwd_cite = Mid(the_HTML_To_Parse, 1, InStr(the_HTML_ToParse, "</A></TD>") - 1)
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "-date-value"), Len(the_HTML_To_Parse))
    ' Extract Forward citation application date
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "-date-value") + 13, Len(the_HTML_To_Parse))
    app_date = Mid(the_HTML_To_Parse, 1, InStr(the_HTML_To_Parse, "</TD>") - 1)
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "patent-date-value"), Len(the_HTML_To_Parse))
    ' Extract Forward citation grant date
    the_HTML_To_Parse = Mid(the_HTML_To_Parse, InStr(the_HTML_To_Parse, "patent-date-value") + 19, Len(the_HTML_To_Parse))
    grant_date = Mid(the_HTML_To_Parse, 1, InStr(the_HTML_To_Parse, "</TD>") - 1)


End If
End Function

步骤1似乎工作正常,因为它确实返回了所需的值。但是,步骤2总是在“无效的过程调用”中失败。

citecount

为了完整性,我还为Function citecount(patent_number As String, patent As String, ccount As Integer, fwd_cite As String, app_date As String, grant_date As String, the_HTML_ToParse) patent = "" ccount = 0 If patent_number = "" Then Exit Function the_start: ' Open Browser Set ie = CreateObject("InternetExplorer.Application") 'Set ie = New InternetExplorerMedium ie.Top = 0 ie.Left = 0 ie.Width = 800 ie.Height = 600 ie.AddressBar = 0 ie.StatusBar = 0 ie.Toolbar = 0 ie.Visible = False 'If False we won't see the window navigation On Error Resume Next ' Open correct webpage based on query of patent_number ie.Navigate ("http://www.google.com/patents/US" & patent_number & "?") ' Wait for the page to load Do DoEvents If Err.Number <> 0 Then ie.Quit Set ie = Nothing GoTo the_start: End If Loop Until ie.ReadyState = 4 ' Copy the HTML results into an object we can use the_HTML_ToParse = ie.Document.Body.InnerHTML the_HTML_ToParse = Mid(the_HTML_ToParse, InStr(the_HTML_ToParse, "<TABLE class=patent-bibdata>"), Len(the_HTML_ToParse)) ' Extract Patent Publication Number (correctness check) the_HTML_ToParse = Mid(the_HTML_ToParse, InStr(the_HTML_ToParse, "<TD class=patent-bibdata-heading>Publication number</TD>") + 90, Len(the_HTML_ToParse)) pat_n = Mid(the_HTML_ToParse, 1, InStr(the_HTML_ToParse, "</TD></TR>") - 1) patent = pat_n If InStr(the_HTML_ToParse, "Referenced by</A> (") > 0 Then ' Extract ccount ForwardCitesScraper.ScrapedCode.Text = the_HTML_ToParse the_HTML_ToParse = Mid(the_HTML_ToParse, InStr(the_HTML_ToParse, "Referenced by</A> (") + 19, Len(the_HTML_ToParse)) ref_by = Mid(the_HTML_ToParse, 1, InStr(the_HTML_ToParse, "),</SPAN>") - 1) ForwardCitesScraper.ScrapedCode.Text = the_HTML_ToParse ccount = ref_by Else: ccount = 0 MsgBox "no fwd cites" End If ie.Quit Set ie = Nothing End Function

添加了(正常运行的)代码
the_HTML_ToParse = the_HTML_To_Parse ' No string seems to be transferred

好的,这就是全部。 我个人认为问题出在其中

  1. 转让{{1}}
  2. 在另一个函数的do循环中使用函数的语法问题。
  3. 希望无论如何都可以提出任何建议。我不到一个星期前就开始了VBA,请原谅我,如果这个问题真的很愚蠢。

    西蒙

0 个答案:

没有答案