从Powershell中的链接中提取值

时间:2013-05-19 18:02:56

标签: powershell csv match href

我在Powershell中有一个函数可以获取文件的内容并将其分解为字段以放入CSV文件。我想知道是否有办法从链接中获取值并将其添加到发送到C​​SV文件的列中,同时保持链接列不变。

function Convert2CSV {
(Get-Content $input_path) -match "href" | % {
$data = ($_ -replace '(?:.*)href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)</a>(?:.*)' , '$1;$2;$3;$4').Split(";")
New-Object psobject -Property @{
    "Link" = $data[0]
    "Date" = $data[1]
    "Time" = $data[2]
    "Item" = $data[3]
    }
} #| Export-Csv $output_file -NoTypeInformation
}

我正在寻找的价值是

FeedDefault_.*?(&) or _Feed.*?(&)

我认为我可以在“Link”= $ data [0]部分添加某种if语句吗?

按要求输出样本。

Value in Link   |   Link                                                                    |   Date        |   Time    |   Item            |
--------------------------------------------------------------------------------------------------------------------------------------------|
bluepebbles     |   http://www.domain.com/page.html?FeedDefault_bluepebbles&something       |   2013-05-19  |   13:30   | Blue Pebbles      |
--------------------------------------------------------------------------------------------------------------------------------------------|
redpebbles      |   http://www.domain.com/page.html?Feed_redpebbles&something               |   2013-05-19  |   13:31   | Red Pebbles       |
--------------------------------------------------------------------------------------------------------------------------------------------|

CSV格式化

Value in Link,Link,Date,Time,Item
"bluepebbles","http://www.domain.com/page.html?FeedDefault_bluepebbles&something","2013-05-19","13:30","Blue Pebbles"
"redpebbles","http://www.domain.com/page.html?Feed_redpebbles&something","2013-05-19","13:31","Red Pebbles"

进入

$input_path = 'f:\mockup\area51\files\link.html'
$output_file = 'f:\mockup\area51\files\db_csv.csv'

$tstampCulture = [Globalization.cultureinfo]::GetCultureInfo("en-GB")

$ie = New-Object -COM "InternetExplorer.Application"
$ie.Visible = $false

$ie.Navigate("file:///$input_path")

$ie.document.getElementsByTagName("a") | % {
  $_.innerText -match 'Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)'
  $obj = New-Object psobject -Property @{
    "Link" = $_.href
    "Date" = $matches[1]
    "Time" = $matches[2]
    "Item" = $matches[3]
  }
  if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
    $obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
  }
  $obj
} #| Export-Csv $output_file -NoTypeInformation

返回错误:

You cannot call a method on a null-valued expression.
At line:12 char:38
+     $ie.document.getElementsByTagName <<<< ("a") | % {
+ CategoryInfo          : InvalidOperation: (getElementsByTagName:String) [], RuntimeException
+ FullyQualifiedErrorId : InvokeMethodOnNull

所以我很确定我可能搞砸了。 :)

1 个答案:

答案 0 :(得分:1)

首先我建议使用-match代替-replace。生成的$matches数组已包含您感兴趣的子匹配,因此无需手动创建此数组。

Get-Content $input_path | ? { $_.contains("href") } | % {
  $_ -match 'href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)</a>'
  $obj = New-Object psobject -Property @{
    "Link" = $matches[1]
    "Date" = $matches[2]
    "Time" = $matches[3]
    "Item" = $matches[4]
  }
  $obj
} #| Export-Csv $output_file -NoTypeInformation

可以使用$obj.Link-match中提取其他信息,然后通过Add-Member将其添加到自定义对象中:

if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
  $obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
}

此外,由于您的输入文件可能是HTML文件,因此您应该考虑使用InternetExplorer COM对象,这样可以更好地控制提取的标记,而不是逐行处理文件。

$ie = New-Object -COM "InternetExplorer.Application"
$ie.Visible = $false

$ie.Navigate("file:///$input_path")
while ( $ie.Busy ) { Start-Sleep -Milliseconds 100 }

$ie.document.getElementsByTagName("a") | % {
  $_.innerText -match 'Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)'
  $obj = New-Object psobject -Property @{
    "Link" = $_.href
    "Date" = $matches[1]
    "Time" = $matches[2]
    "Item" = $matches[3]
  }
  if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
    $obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
  }
  $obj
}