将HTML中的行转换为CSV

时间:2013-02-26 17:04:22

标签: powershell html-parsing

我有一个html文件,其中包含

格式的链接
<a href="http://www.google.com>Date: 25.02.2013 10:30 Name: Google</a><br>

我正在尝试编写一个powershell脚本来获取链接,日期,时间和名称,并将它们以CSV格式(链接,日期,时间,名称)放置

以下将给我链接,但不是其他信息,我只是遗漏了什么?正则表达式有效,但在寻找名称的名称中放弃“名称:”会很有帮助。

$input_path = 'C:\temp\myfile.html'
$output_file = 'C:\temp\myfile.csv'
$regex_link = '([a-zA-Z]{4})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)’
$regex_date = '\d{2}\.\d{2}\.\d{4}'
$regex_time = '\d{2}:\d{2}'
$regex_name = 'Name:\s([\w]*)'
$myVar = select-string -Path $input_path -Pattern $regex_link, $regex_date, $regex_time, $regex_name -AllMatches| % { $_.Matches } | % { $_.Value } 
$myVar

1 个答案:

答案 0 :(得分:0)

这不是我猜的最干净的解决方案,但它适用于我的测试:

$input_path = 'C:\temp\myfile.html'
$output_file = 'C:\temp\myfile.csv'

(Get-Content $input_path) -match "href" | % {
$data = ($_ -replace '(?:.*)href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Name:\s*(.*)</a>(?:.*)' , '$1;$2;$3;$4').Split(";")
New-Object psobject -Property @{
    "Link" = $data[0].Trim()
    "Date" = $data[1].Trim()
    "Time" = $data[2].Trim()
    "Name" = $data[3].Trim()
    }
} | Select-Object Link, Date, Time, Name | Export-Csv $output_file -NoTypeInformation

Myfile.html:

<html>
<body>
asdsanfkj
djaksl
sadjklas
<a href="http://www.google.com">Date: 25.02.2013 10:30 Name: Googledas adka kasjiw</a><br>
sadsadmdsa
<a href="http://www.google2.com">Date: 22.22.2222 20:20 Name: Google2asd addasd </a><br>
sajl
dasjdsa
asd
</body>
</html>

Myfile.csv:

"Link","Date","Time","Name"
"http://www.google.com","25.02.2013","10:30","Googledas adka kasjiw"
"http://www.google2.com","22.22.2222","20:20","Google2asd addasd"