所以我有以下代码从网站上提取HTML片段。
这种方式很有用,在parse.txt
文件中,我可以看到innerHTML
,其中包含我想要的HTML。
然而,HTML文件中包含更多内容,它包含所有页眉和页脚,这不会显示在文本文件的innerHTML
对象中。
我想要做的只是将该对象(内部HTML)保存在HTML文件中。
$ie = New-Object -com InternetExplorer.Application
$ie.silent = $false
$ie.navigate2("www.website.com/job1")
$ie.Visible = $true
while($ie.busy) {start-sleep 1}
# grab the table html
$ie.document.IHTMLDocument3_getElementsByTagName("div") | Where{ $_.className -eq 'job-template__wrapper' } | Out-file "C:\Users\user\Desktop\Parse.txt"
$ie.Document.body.innerHTML | Out-file "C:\Users\user\Desktop\Parse.html"
$ie.quit()
答案 0 :(得分:0)
管理得出来,可能不是最好的方法,但这里是代码:
# Counters
$i = 1
$page = 1
# Main loop : goes until you have x amount of job JobAds
# This isnt 100% accurate it will stop after the foreach loop below finishes,
# so you may end up with more than x but never less
while($i -le 2000)
{
# IE connection
$ie = New-Object -com InternetExplorer.Application
$ie.Visible = $true # false for silent run
$ie.silent = $false # false for silent run
$ie.navigate2("https://www.website.com.au/page?page=$page")
# wait until ie has finished
while($ie.busy) {start-sleep 1}
# Grab the 22 job links from the set seek page
$site = Invoke-WebRequest -Uri http://www.website.com.au/page
$site.Links.Href | Sort-Object | Get-Unique > C:\Users\user\Desktop\links.txt
$links = @(Get-Content C:\Users\user\Desktop\links.txt | Where-Object { $_ -like '*/job/*' })
# loop through each job link
foreach ( $link in $links )
{
# Connect to job site
$ie.navigate2("http://www.website.com.au" + $link)
while($ie.busy) {start-sleep 1}
# Download and copy to HTML
$ie.document.IHTMLDocument3_getElementsByTagName("div") | Where{ $_.className -eq 'job-template__wrapper' }
$ie.Document.body.innerHTML > "C:\Users\user\Desktop\web_scrape\scrape$i.html"
# Store in variable
$content= Get-Content "C:\Users\user\Desktop\web_scrape\scrape$i.html" | Out-String
# Remove header / footer
$start= $content.indexof('</style>') +8
$end= $content.indexof("</span>", $start)
$length =$end - $start
$content.substring($start, $length) | out-file "C:\Users\user\Desktop\web_scrape\scrape$i.html"
# Add html tags for word conversion
'<!DOCTYPE html PUBLIC >' + (Get-Content "C:\Users\user\Desktop\web_scrape\scrape$i.html" -Raw) | Set-Content "C:\Users\user\Desktop\web_scrape\scrape$i.html"
'<HTML>' + (Get-Content "C:\Users\user\web_scrape\scrape$i.html" -Raw) | Set-Content "C:\Users\user\web_scrape\scrape$i.html"
Add-Content -Path "C:\Users\user\Desktop\web_scrape\scrape$i.html" -Value '</HTML>'
# Set file variables
$htmlFile = ('C:\Users\owain.esau\Desktop\web_scrape\scrape' + $i + '.html');
$docFile = ('C:\Users\owain.esau\Desktop\web_scrape\word\scrape' + $i + '.docx');
# Convert html to word
htmlToWord $htmlFile $docFile
$i += 1
}
$page += 1
$ie.quit()
}