我有一个以" .EXCEL"文件(假excel,但这超出了我们的控制范围)。它实际上是HTML,但我很难将其转换为XML。
HTML看起来像这样:
<table class="c41">
<tr class="c5">
<td valign="top" class="c6"><p class="c7"><span class="c8">Cash Activity </span>
</p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">FRIDAY </span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c5">
<td valign="top" class="c6"><p class="c11"><br/></p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">05-JAN-18</span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c12">
<td valign="top" class="c13"><p class="c7"><span class="c14">Prior Day Available Balance</span></p>
</td>
<td valign="top" class="c15"><p class="c10"><span class="c16">6,472,679.45
</span></p>
</td>
</tr>
</table>
看起来像这样:
Cash Activity | Friday 05-JAN-18
______________________________________________
Prior Day Available Balance | $123,456.58
无论如何,我可以在Powershell中将其解析为输出XML,如下所示:
<?xml version="1.0" encoding="utf-8" ?>
<Cash Activities>
<Cash Activity>
<Activity>Prior Day Available Balance</Activity>
<Balance>123456.58</Balance>
</Cash Activity>
</Cash Activities>
到目前为止,Powershell只是将其从电子邮件中删除,并将其保存为HTML文件:
$account = "my.email@mycompany.com"
#date to append to new file name
$date = Get-Date -Format yyyyMMdd
$searchDate = Get-Date -Format M/dd/yyyy
Write-Host $searchDate
#file to save attachment as
$newFileName = "Balance_Import_$date.xml"
$newFilePath = "C:\MyDirectory\\"
#Go into Outlook and get the MAPI
$mail = New-Object -ComObject outlook.application
$mailNS = $mail.GetNamespace("MAPI")
#get the account and Inbox we want
$myAcount = $mailNS.Folders | ? {$_.Name -eq $account}
$myInbox = $myAcount.Folders | ? {$_.Name -eq "Inbox"};
$myItems = $myAcount.Items | ? {$_.ReceivedTime.Date -eq $searchDate};
#loop through the Inbox and get any Attachments with the extension of .EXCEL
foreach ($f in $myInbox)
{
foreach($i in $f.Items)
{
Write-Host "Checking "$i.Subject"..."
if($i.ReceivedTime.Date -eq $searchDate)
{
Write-Host "---"
Write-Host $i.Subject
Write-Host "---"
foreach($a in $i.Attachments)
{
if($a.FileName -like "*.EXCEL")
{
#Move the attachment to the desired directory
$a.SaveAsFile((Join-Path $newFilePath $newFileName))
Write-Host $a.FileName " Saved as HTML"
#TODO: PARSE HTML INTO XML
}
}
}
}
}
答案 0 :(得分:2)
解析伪造的Excel / HTML输入可能存在一些问题:
假设您上面的HTML示例处理了第一个问题,您可以通过解码输入来强制解决第二个问题,如下所示:
现在只需要一些简单的XPath来选择你想要获得上面指定的所需XML的节点(测试和工作):
[xml]$html = [System.Net.WebUtility]::HtmlDecode(@'
<table class="c41">
<tr class="c5">
<td valign="top" class="c6"><p class="c7"><span class="c8">Cash Activity </span>
</p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">FRIDAY </span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c5">
<td valign="top" class="c6"><p class="c11"><br/></p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">05-JAN-18</span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c12">
<td valign="top" class="c13"><p class="c7"><span class="c14">Prior Day Available Balance</span></p>
</td>
<td valign="top" class="c15"><p class="c10"><span class="c16">6,472,679.45
</span></p>
</td>
</tr>
</table>
'@);
答案 1 :(得分:0)
由于存在ConvertTo-XML
这样的问题,您的问题应该是&#34;如何将HTML表转换为Powershell对象&#34;。
对于您不了解元素的表格,您可以使用正则表达式,即使它是不好的做法。 https://social.technet.microsoft.com/Forums/scriptcenter/en-US/4fc6d20c-87bd-44e4-8e46-c72d63fe7b41/html-table-to-powershell-object?forum=ITCG
将数据导入Powershell对象后,可以非常轻松地更改标题,或者根据需要清理数据。
然后使用上述cmdlet转换为XML。
答案 2 :(得分:0)
您可以使用Internet Explorer COM对象获取HTML结构,然后提取所需的数据,最后生成XML。这是一个展示如何执行此操作的示例:
# Create the IE com object
$oIE = New-Object -COM InternetExplorer.application
# Load the HTML document
$oie.Navigate("$pwd/html.htm")
# Function to convert the html COM object graph into PSCustomObjects
# This makes the tree a bit easier to work with since you can access by node name
function Export-Html ($html)
{
[pscustomobject] @{
$html.NodeName = @{
children = $html.Children | foreach { Export-Html $_ }
text = $html.InnerText
}
}
}
# Convert the HTML into a PSCustomObject graph
$obj = Export-Html $oie.Document.Body
# Get the activity string explicitly walking the document hierarchy (could use tree search instead...)
$activity = $obj.BODY.children.
Table.Children.
Tbody.children.
TR[2].Children[0].
TD.
Text
# Get the balance, cast to double to get rid of the commas in the HTML text
[double] $balance = $obj.BODY.children.
Table.Children.
Tbody.children.
TR[2].Children[1].
TD.
Text
# Finally emit the XML with activity and balance substituted
@"
<?xml version="1.0" encoding="utf-8" ?>
<Cash Activities>
<Cash Activity>
<Activity>$activity</Activity>
<Balance>$balance</Balance>
</Cash Activity>
</Cash Activities>
"@