Powershell:Screenscraping http并返回特定行作为变量

时间:2011-08-05 11:53:40

标签: powershell web-scraping

我对PowerShell比较陌生,并且达到了我的知识极限。我正在编写一个脚本来从内部网页中删除备份数据,然后从scrape中提取信息以进行操作,然后在excel中显示。

$Yesterday = [DateTime]::Now.AddDays(-1)
$datestr = $Yesterday.ToString("dd-MMM-yyyy")
$WebClient = New-Object System.Net.WebClient
$Results = $WebClient.DownloadString("http://fakeurl")

这导致包含http代码的大量输出以及我感兴趣的数据,但所有这些数据都聚集在一起。然后我这样做:

[StringSplitOptions]$option = "None"
[string[]]$separator = "</td>"
$SPL = $Results.Split($separator, $option)

这将数据拆分为更易读的格式。这是我的部分片段 感兴趣的是$ SPL。

<tr><td headers="HOST_NAME" class="t13dataalt">server01
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51
<td headers="BACKUP_TYPE" class="t13dataalt">differential
<td headers="SCHEDULE" class="t13dataalt">daily
<td align="right"  headers="SIZE_MB" class="t13dataalt">       2,091.18
<td headers="IMAGES" class="t13dataalt">1
<td headers="EXIT_STATUS" class="t13dataalt">0
</tr><tr><td headers="HOST_NAME" class="t13data">server02
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41
<td headers="BACKUP_TYPE" class="t13data">differential
<td headers="SCHEDULE" class="t13data">daily
<td align="right"  headers="SIZE_MB" class="t13data">       2,496.31
<td headers="IMAGES" class="t13data">1
<td headers="EXIT_STATUS" class="t13data">0

由此我需要提取开始和结束时间,计算已用时间,并返回最近备份的EXIT_STATUS。我尝试了以下但是我觉得我可能正在咆哮错误的树:

$Position = select-string -inputobject $SPL -pattern $datestr

$ Position.matches导致:

PS C:\Scripts> $Position.matches

Groups   : {03-Aug-2011}
Success  : True
Captures : {03-Aug-2011}
Index    : 12056
Length   : 11
Value    : 03-Aug-2011

我的理论是使用添加到长度的索引来执行子字符串以提取日期之后的时间值,但我不知道如何做到这一点。我也认为这有点具有初步意义。必须有一种更简单的方法可以从该变量返回我需要的信息行,而不必计算到该点,然后将其余部分拆掉?


好的,因为我不确定如何在页面底部添加这样的部分,我将在此处添加。

这是我目前的脚本,它运行时没有任何错误,但不会返回任何结果。

# Get yesterdays date and convert it to the required search format
    $Yesterday = [DateTime]::Now.AddDays(-1)
    $datestr = $Yesterday.ToString("dd-MMM-yyyy")

# Scrape the webpage
    $url = "http://fake-url"
    $WebClient = New-Object System.Net.WebClient
    $Results = $WebClient.DownloadString($url)

# Determine if the previous day is listed in the backups
    $IsDateThere = $Results.Contains($datestr)
        If ($IsDateThere){
            # split the data into rows
            [StringSplitOptions]$option = "None"
            [string[]]$separator = "</td>"
            $SPL = $Results.Split($separator, $option)

            #strip the data into a hash table
            $SPL | 
                Foreach-Object {
                    where {$_ -match 'headers="(.*)" class.*>(.*)'} |
                        ForEach-Object { 
                        @{
                                $matches[1] = ($matches[2]).trim() 
                            }
                        }
                }           
        }
        Else{
            Write-Host "Yesterday's date not found"
        }

有什么想法吗?我不知道下一步该做什么来获取最新备份的开始时间和结束时间以及退出代码作为变量。

2 个答案:

答案 0 :(得分:3)

我会接近这样的事情

$html = @"
<tr><td headers="HOST_NAME" class="t13dataalt">server01
<td headers="AUTOSYS_JOB" class="t13dataalt">nbu.os.wn.135b.server01
<td headers="START_TIME" class="t13dataalt">01-Aug-2011 21:23
<td headers="END_TIME" class="t13dataalt">01-Aug-2011 21:51
<td headers="BACKUP_TYPE" class="t13dataalt">differential
<td headers="SCHEDULE" class="t13dataalt">daily
<td align="right"  headers="SIZE_MB" class="t13dataalt">       2,091.18
<td headers="IMAGES" class="t13dataalt">1
<td headers="EXIT_STATUS" class="t13dataalt">0
</tr><tr><td headers="HOST_NAME" class="t13data">server02
<td headers="AUTOSYS_JOB" class="t13data">nbu.os.wn.135b.server02
<td headers="START_TIME" class="t13data">31-Jul-2011 21:22
<td headers="END_TIME" class="t13data">31-Jul-2011 21:41
<td headers="BACKUP_TYPE" class="t13data">differential
<td headers="SCHEDULE" class="t13data">daily
<td align="right"  headers="SIZE_MB" class="t13data">       2,496.31
<td headers="IMAGES" class="t13data">1
<td headers="EXIT_STATUS" class="t13data">0
"@

$html -split "`r`n" | where {$_ -match 'start_time|end_time'} |
    ForEach {
        $pos = $_.IndexOf("headers")
        $begin = $pos+9
        $end = $_.IndexOf('"', $begin)

        new-object PSObject -Property @{
            Key   = $_.SubString($begin, $end-$begin)
            Value = Get-Date( $_.SubString( $_.IndexOf(">")+1 ) )
        }
    }

结果

Key        Value               
---        -----               
START_TIME 8/1/2011 9:23:00 PM 
END_TIME   8/1/2011 9:51:00 PM 
START_TIME 7/31/2011 9:22:00 PM
END_TIME   7/31/2011 9:41:00 PM

答案 1 :(得分:1)

这不是一个原始的答案 - 只是Doug的替代版本使用reg ex来捕获所有数据:

$html -split "`n" | where {$_ -match 'headers="(.*)" class.*>(.*)'} |
    % { 
        @{
                $matches[1] = ($matches[2]).trim() 
            }
    }

编辑:使用问题中的代码:

$Yesterday = [DateTime]::Now.AddDays(-1)
$datestr = $Yesterday.ToString("dd-MMM-yyyy")
$WebClient = New-Object System.Net.WebClient
$Results = $WebClient.DownloadString("http://fakeurl")

[StringSplitOptions]$option = "None"
[string[]]$separator = "</td>"
$SPL = $Results.Split($separator, $option)

$SPL | 
    Foreach-Object {
        where {$_ -match 'headers="(.*)" class.*>(.*)'} |
            % { 
            @{
                    $matches[1] = ($matches[2]).trim() 
                }
            }
    }

编辑2:

    $SPL | 
        Foreach-Object {
            where {$_ -match 'headers="(.*)" class.*>(.*)'} |
                % { 
if (($matches[2]).trim() -eq $datestr ) { "$($matches[1]) is yesterday's back up" }
                }
        }