Question

我正在尝试从文本文件中提取URL。我使用PowerShell来做到这一点。 URL的最后一部分每次都会有所不同。该文件的片段如下：

<table class="button" style="border-collapse: collapse; border-spacing: 0; overflow: 
hidden; padding: 0; text-align: left; vertical-align: top; width: 100%;"><tbody>
<tr style="padding: 0; text-align: left; vertical-align: top;"><td style="-moz-hyphens: none; 
-webkit-hyphens: none; -webkit-text-size-adjust: none; background: #049FD9; 
border: none; border-collapse: collapse !important; border-radius: 2px; color: #fff; display: block; font-family: 'Helvetica-Light','Arial',sans-serif; font-size: 14px; font-weight: lighter; hyphens: none; line-height:19px; margin: 0; padding: 8px 16px; text-align: center; vertical-align: top; width: auto 
!important; word-break: keep-all;">
<a href="https://www.website.com:443/idb/setPassword?t=BcHJEoIgAADQD%2BKQjqZ4VEKtBHLJJm82uWDuxCR%2Bfe%2B58Rl9HRz6QddWkO5MLDXuF6e9m%2Bo0z%2FCVS%2B9IenAp5m5yTfYRa%2BAn4jdWHHF7HTyqRZiRRiNDEE%2BK7ZJywLKeNCTj4ewu4QNu02qXB0ZTXTyxXADwaLeluZGVPCxGXunpVcHbiCVAWRR7ykqGensLVBsqNUpl%2FQE%3D" 
style="-webkit-text-size-adjust: none; font-weight: 100; color: #fff; font-family: 'Helvetica-Light','Arial',sans-serif; font-size: 20px; font-weight: lighter; line-height: 32px; text-decoration: none;">Get Started</a> </td></tr></tbody></table></td>

我想提取以：

开头的网址

https://www.website.com:443/idb/setPassword

t=之后的字符串每次都会有所不同。如何将整个URL提取到一个变量中，然后我可以解析该变量以获取我需要的信息，这是?t=之后的字符串？

Answer 1

这是一个解决方案，它使用Select-String和正则表达式的组合来获取URL，并使用[system.uri]类来查询它。

$Text = get-content 'html-sample.txt'
$URLString = ((Select-String '(http[s]?)(:\/\/)([^\s,]+)(?=")' -Input $Text).Matches.Value)

#At this point $URL is a string with just the URL and querystring as requested
$URLString

#Heres how you might interrogate it
[system.uri]$URL = $URLString
$Token = ($URL.Query -split '=')[1]
$URL.host
$Token

<强>解释

使用带有(http[s]?)(:\/\/)([^\s,]+)(?=")的正则表达式Select-String来提取URL。请注意，默认情况下，这只会获得第一个匹配项，如果您需要匹配多个网址，则使用-AllMatches Select-String开关，然后您需要通过ForEach循环处理每个结果。
使用[system.uri]将URL转换为URI对象。
访问对象的host属性以返回基本URL。
访问对象的query属性以返回查询字符串并替换＆＃39;？t =＆＃39;使用正则表达式的字符串的一部分，该正则表达式仅在字符串（^标记）的开头出现的替换位置，并使用反斜杠来转义其他正则表达式特殊字符。

Answer 2

尝试以下方法：

$content = Get-Content -Path 'C:\test.txt'
[regex]$regex = '(?<=href="https:\/\/www\.website\.com:443\/idb\/setPassword\?t=)(.*)(?=" )'
$regex.Matches($content).Value

在$content中，将路径替换为包含该网址的文本文件，并使用正确的网址更新$regex。

此方法使用Regex在(?<= )网站网址之前和(?= )之后匹配，然后选择中间的文字。

Answer 3

这是通过强制[xml]将文件作为xmldocument读取的另一种方式....

$thisxml = [xml](gc .\hypertext.html)

然后使用xpath

深入查看所需的节点

$thisxpath = ($thisxml).SelectNodes("//table//tr//td//a").href

然后转换[system.uri]来解析并选择你想要的uri文件。

$thisuri = [System.Uri]$thisxpath | %{($_.Scheme + "://" + $_.host + $_.LocalPath)}

从文本文件中提取URL，然后使用Powershell

3 个答案: