PowerBI-网页爬虫超过一百万页

时间:2018-11-29 19:22:20

标签: web-scraping powerbi

我想从http://mbsweblist.fsco.gov.on.ca/ShowLicence.aspx?M13000248~提取代理/经纪人名称,许可证编号和有效期

“ M”之后的数字是许可证编号。 我具有强大的查询功能,可以拉取几个许可证的数据。如何提取列表= {00000000..99999999}的数据? PowerBI不适合此目的吗?还有其他方法吗?

谢谢,感谢您的帮助。

(page as number) as table =>
let
Source = Web.Page(Web.Contents("http://mbsweblist.fsco.gov.on.ca/ShowLicence.aspx?M"&Number.ToText(page)&"~")),
Data1 = Source{1}[Data],
#"Changed Type" = Table.TransformColumnTypes(Data1,{{"Column1", type text}, {"Column2", type text}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each ([Column1] = "Agent/Broker Name:" or [Column1] = "Expiry Date:" or [Column1] = "Licence #:"))
in
#"Filtered Rows"



let
Source = {18001928,13000248},
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "Page"}}),
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "Custom", each GetData([Page])),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom", "Custom", {"Column1", "Column2"}, {"Custom.Column1", "Custom.Column2"})
in
#"Expanded Custom"

1 个答案:

答案 0 :(得分:1)

首先,如果要尝试刮取“超过一百万个页面”,我建议您谨慎-可以肯定的是,Web服务器会将重复的请求视为违反其服务条款/某种形式的攻击。

但是,从技术能力的角度来回答问题-您列出许可证号,然后将许可证号传递给函数以获取Web数据的方法几乎是正确的。不过,您的行使并不十分正确。

第1步:创建一个函数,该函数以所需格式为一个URL提取所需数据,该URL是通过传递许可证号作为参数而生成的。我将此函数命名为WebData:

(LicenceNumber) =>
let
    Source = Web.Page(Web.Contents("http://mbsweblist.fsco.gov.on.ca/ShowLicence.aspx?M" & Number.ToText(LicenceNumber) & "~")),
    WebData = Source{1}[Data],
    #"Extracted Text Before Delimiter" = Table.TransformColumns(WebData, {{"Column1", each Text.BeforeDelimiter(_, ":"), type text}}),
    #"Removed Top Rows" = Table.Skip(#"Extracted Text Before Delimiter",1),
    #"Transposed Table" = Table.Transpose(#"Removed Top Rows"),
    #"Promoted Headers" = Table.PromoteHeaders(#"Transposed Table", [PromoteAllScalars=true])
in
    #"Promoted Headers"

现在创建第二个查询,列出要检索其数据的许可证号,然后使用WebData函数检索每个页面数据,最后将这些数据合并到一个表中:

let
    Source = {13000246..13000250},
    #"Convert to Table" = Table.FromList(Source,Splitter.SplitByNothing(),{"Licence Number"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Convert to Table",{{"Licence Number", Int64.Type}}),
    #"Get WebData" = Table.AddColumn(#"Changed Type", "WebData", each try WebData([Licence Number]) otherwise #table({},{})),
    #"Combine WebData" = Table.Combine(#"Get WebData"[WebData]),
    #"Changed Types" = Table.TransformColumnTypes(#"Combine WebData",{{"Agent/Broker Name", type text}, {"Licence #", type text}, {"Brokerage Name", type text}, {"Licence Class", type text}, {"Status", type text}, {"Issue Date", type date}, {"Expiry Date", type date}, {"Inactive Date", type date}})
in
    #"Changed Types"

请注意,源代码行的开始和结束值确定了所用列表的范围。