尝试使用PowerShell解析网页中的内容并与之交互

时间:2017-07-30 04:01:44

标签: macos powershell web-scraping powershell-core

这就是我在PowerShell中所做的:

PS > $source = "http://www.bing.com/search?q=sqrt(2)"
PS > $result = Invoke-WebRequest $source
PS > $resultContainer = $result.ParsedHtml.GetElementById("results_container")

这是我收到的错误消息:

The property 'ParsedHtml' cannot be found on this object. Verify that the property exists.                                                                                   At line:1 char:1                                                                                                                                                             + $resultContainer = $result.ParsedHtml.GetElementById("results_contain ...                                                                                                  
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], PropertyNotFoundException
    + FullyQualifiedErrorId : PropertyNotFoundStrict

2 个答案:

答案 0 :(得分:4)

我不相信你可以在非Windows平台上使用PowerShell做到这一点(至少现在还没有)。要解析HTML内容,PowerShell使用MSHTML.DLL和/或Windows外部不存在的其他Internet Explorer / Edge组件。请注意,GetElementById just proxies to the COM object并且您的环境中没有COM对象。

您可以检查RawContent返回的对象的Invoke-WebRequest属性并自己解析该字符串以查找所需的内容,但使用正则表达式解析HTML是不可取的,所以你'我必须使用其他方法。

顺便说一句,我无法找到您在示例中使用的id results_container {{}}}元素。

答案 1 :(得分:0)

有效的方法(但有点混乱)是在Powershell中将AngleSharp用作.Net程序集。 Powershell github issue中也建议使用。

[string]$html = "<!DOCTYPE html>
<html lang=en>
    <meta charset=utf-8>
    <meta name=viewport content=""initial-scale=1, minimum-scale=1, width=device-width"">
    <title>Error 404 (Not Found)!!1</title>
    <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
    <p><b>404.</b> <ins>That’s an error.</ins>
    <p>The requested URL <code>/error</code> was not found on this server.  <ins>That’s all we know.</ins>";

#Loads assembly for angle sharp: https://stackoverflow.com/questions/39257572/loading-assemblies-from-nuget-packages 
#WARNING: probably in a non-portable way.
$standardAssemblyFullPath = (Get-ChildItem -Filter *.dll -Recurse (Split-Path (get-package AngleSharp).Source)).FullName | Where-Object {$_ -like "*standard*"}
Add-Type -Path $standardAssemblyFullPath

$parser = New-Object AngleSharp.Parser.Html.HtmlParser
$document = $parser.Parse($html);

$elements = $document.All | Where-Object {$_.id -eq "logo"};

Write-Host $elements.OuterHtml