只从网站获取innerHTML?

时间:2017-11-23 00:51:43

标签: powershell

所以我有以下代码从网站上提取HTML片段。

这种方式很有用,在parse.txt文件中,我可以看到innerHTML,其中包含我想要的HTML。

然而,HTML文件中包含更多内容,它包含所有页眉和页脚,这不会显示在文本文件的innerHTML对象中。

我想要做的只是将该对象(内部HTML)保存在HTML文件中。

$ie = New-Object -com InternetExplorer.Application
$ie.silent = $false
$ie.navigate2("www.website.com/job1")
$ie.Visible = $true
while($ie.busy) {start-sleep 1}

# grab the table html
$ie.document.IHTMLDocument3_getElementsByTagName("div") | Where{ $_.className -eq 'job-template__wrapper' } | Out-file "C:\Users\user\Desktop\Parse.txt"
$ie.Document.body.innerHTML  | Out-file "C:\Users\user\Desktop\Parse.html"

$ie.quit()

1 个答案:

答案 0 :(得分:0)

管理得出来,可能不是最好的方法,但这里是代码:

# Counters
$i = 1
$page = 1


# Main loop : goes until you have x amount of job JobAds
# This isnt 100% accurate it will stop after the foreach loop below finishes, 
#   so you may end up with more than x but never less
while($i -le 2000) 
{
    # IE connection
    $ie = New-Object -com InternetExplorer.Application
    $ie.Visible = $true # false for silent run
    $ie.silent = $false # false for silent run
    $ie.navigate2("https://www.website.com.au/page?page=$page")

    # wait until ie has finished
    while($ie.busy) {start-sleep 1} 

    # Grab the 22 job links from the set seek page
    $site = Invoke-WebRequest -Uri http://www.website.com.au/page
    $site.Links.Href | Sort-Object | Get-Unique > C:\Users\user\Desktop\links.txt
    $links = @(Get-Content C:\Users\user\Desktop\links.txt | Where-Object { $_ -like '*/job/*' })

    # loop through each job link
    foreach ( $link in $links )
    {
        # Connect to job site
        $ie.navigate2("http://www.website.com.au" + $link)
        while($ie.busy) {start-sleep 1}


        # Download and copy to HTML
        $ie.document.IHTMLDocument3_getElementsByTagName("div") | Where{ $_.className -eq 'job-template__wrapper' } 
        $ie.Document.body.innerHTML  > "C:\Users\user\Desktop\web_scrape\scrape$i.html"

        # Store in variable
        $content= Get-Content "C:\Users\user\Desktop\web_scrape\scrape$i.html" | Out-String

        # Remove header / footer
        $start= $content.indexof('</style>') +8
        $end= $content.indexof("</span>", $start)
        $length =$end - $start
        $content.substring($start, $length) | out-file "C:\Users\user\Desktop\web_scrape\scrape$i.html"


        # Add html tags for word conversion
        '<!DOCTYPE html PUBLIC >' + (Get-Content "C:\Users\user\Desktop\web_scrape\scrape$i.html" -Raw) | Set-Content "C:\Users\user\Desktop\web_scrape\scrape$i.html"
        '<HTML>' + (Get-Content "C:\Users\user\web_scrape\scrape$i.html" -Raw) | Set-Content "C:\Users\user\web_scrape\scrape$i.html"
        Add-Content  -Path "C:\Users\user\Desktop\web_scrape\scrape$i.html" -Value '</HTML>'

        # Set file variables
        $htmlFile = ('C:\Users\owain.esau\Desktop\web_scrape\scrape' + $i + '.html');
        $docFile = ('C:\Users\owain.esau\Desktop\web_scrape\word\scrape' + $i + '.docx');

        # Convert html to word
        htmlToWord $htmlFile  $docFile

        $i += 1

    }

    $page += 1
    $ie.quit()

}