Powershell正则表达式匹配字符串除了第一个

时间:2014-12-13 10:43:19

标签: html regex powershell match

我有以下html模式。

href="{{url}}" class="item-name prdctNm">{{name}}</a><div>
href="/drugs/sporanox-100-mg-33294" class="item-name prdctNm">Sporanox (100 Mg)</a>
href="/drugs/sporan-200-mg-34240" class="item-name prdctNm">Sporan (200 Mg)</a>
href="/drugs/spornid-500-mg-25051" class="item-name prdctNm">Spornid (500 Mg)</a>

我想要的是获得像

这样的产品的名称
  

Sporanox (100mg), Sporan (200 mg) and Spornid (50mg).

**

  

更新的解决方案

**:它匹配几乎整个页面。从页面上的first instance of "item-name prdctNm"last <\a>开始 - 它与之间的所有内容相匹配。但是,我需要匹配它旁边的text between "item-name prdctNm" and tag <\a>

现在它完美无缺:

$regex = [RegEx]'"item-name prdctNm"(.[^{}<>]*)</a>'
$url = ‘https://www.xxx.com/search/all?name=sporanox’
$wc = New-Object System.Net.WebClient
$content = $wc.DownloadString($url)
$regex.Matches($content) | ForEach-Object { $_.Groups[1].Value }

1 个答案:

答案 0 :(得分:1)

使用以下正则表达式,然后在Groups[0]包含整个匹配项的最后一个位置打印组索引1,Groups[1]包含第一组捕获的字符。

$regex = [RegEx]'"item-name prdctNm">([^}{<>]*)</a>'
$url = ‘https://www.xxx.com/search/all?name=sporanox’
$wc = New-Object System.Net.WebClient
$content = $wc.DownloadString($url)
$regex.Matches($content) | ForEach-Object { $_.Groups[1].Value }