解析HTML中的相关标签

时间:2014-12-25 18:01:02

标签: regex powershell

我需要在Powershell中从outerHTML下面提取item-name,item-manufacturer,item-actual。

<DIV class=row>
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A>
    <DIV class=text-small>2 ml</DIV>
    <DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 6</DIV>
    <DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">
    <DIV class=row>
    <DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A>
    <DIV class=text-small>28 Tablets</DIV>
    <DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 5.72</DIV>
    <DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">

渲染输出如下所示:

Spasmonil (20mg) - Cipla Limited - Rs. 6
Sprintas (75mg) - Intas Laboratories Pvt - Rs. 5.72

我是以非常有效的方式进行的,我在不同的txt文件中得到4个输出(drugname,drugsquan,drugspric,drugsmanu),然后我手动组合它。有人可以帮助我以优雅的方式做到这一点。

$regex1 = 'item-name.*?>(.*?)</A>'
$regex2 = 'text-small>(.*?)</DIV>'
$regex3 ='"item-manufacturer visible-xs">(.*?)</DIV>'
$regex4 ='item-actual>(.*?)</DIV>'

$drugsname = $ie.Document.body.outerHTML -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex1){
      $matches[1]      
    }
  }

$drugsquan = $ie.Document.body.outerHTML  -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex2){
      $matches[1]      
    }
  }

$drugsmanu = $ie.Document.body.outerHTML  -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex3){
      $matches[1]      
    }
  }

$drugspric = $ie.Document.body.outerHTML  -split "`r`n" | 
  ForEach-Object{
    If($_ -match $regex4){
      $matches[1]      
    }
  }

$drugsname > "d:\users\desktop\HKD\($control)drugsname.txt"
$drugsquan > "d:\users\desktop\HKD\($control)drugsquan.txt"
$drugsmanu > "d:\users\desktop\HKD\($control)drugsmanu.txt"
$drugspric > "d:\users\desktop\HKD\($control)drugspric.txt"

1 个答案:

答案 0 :(得分:2)

在here-string中使用多行/单行正则表达式(又名“罐中的巨型虾”):

$data = 
@'
<DIV class=row>
<DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/39467/spasmonil-20mg">Spasmonil (20mg)</A>
    <DIV class=text-small>2 ml</DIV>
    <DIV class="item-manufacturer visible-xs">Cipla Limited</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Cipla Limited</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 6</DIV>
    <DIV class=item-price>Rs. 6</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">
    <DIV class=row>
    <DIV class="col-sm-5 col-xs-8"><A class=item-name href="/details/drugs/40759/sprintas-75mg">Sprintas (75mg)</A>
    <DIV class=text-small>28 Tablets</DIV>
    <DIV class="item-manufacturer visible-xs">Intas Laboratories Pvt Ltd</DIV></DIV>
    <DIV class="col-sm-5 hidden-xs"><SPAN class=item-manufacturer>Intas Laboratories Pvt Ltd</SPAN></DIV>
    <DIV class="col-sm-2 col-xs-4 text-right">
    <DIV class=item-actual>Rs. 5.72</DIV>
    <DIV class=item-price>Rs. 5.72</DIV></DIV></DIV></LI>
    <LI class="list-item item js-drug">
'@

[regex]$regex = 
@'
(?ms).*?<DIV class=row>.*?
.+?item-name href=".+?>(.+?)</A>.*?
.+?text-small>(.+?)</DIV>.*?
.+?item-manufacturer.+?>(.+?)</DIV></DIV>.*?
.+?item-actual>(.+?)</DIV>
'@

$regex.Matches($data) |
foreach {
          [PSCustomObject]@{
          Name = $_.Groups[1].value
          Quantity = $_.Groups[2].Value
          Manufacturer = $_.Groups[3].Value
          Price = $_.Groups[4].Value
        }
}

Name                       Quantity                   Manufacturer               Price                    
----                       --------                   ------------               -----                    
Spasmonil (20mg)           2 ml                       Cipla Limited              Rs. 6                    
Sprintas (75mg)            28 Tablets                 Intas Laboratories Pvt Ltd Rs. 5.72                 

现在您有了一个对象集合,您可以对其进行排序,过滤,格式化和导出以满足您的需求。