Question

我使用以下代码通过扫描任何产品页面的html来源从amazon.com检索运费。但输出不是我想要的。下面是代码。

regexString = "<span class=\"plusShippingText\">(.*)</span>";
match = Regex.Match(htmlSource, regexString);
string shipCost = match.Groups[1].Value;
MessageBox.Show(shipCost);

它会显示一个消息框，显示返回运费

&nbsp;+&nbsp;Free Shipping</span>

但实际上我只需要以下干净的文字。

Free Shipping

请帮我解决这个问题。

Answer 1

您只需要删除HTML标记即可你可以使用以下功能：

shipCost = System.Net.WebUtility.HtmlDecode(shipCost).Replace("+","").Trim()

Answer 2

您可以尝试使用以下代码（尽管使用正则表达式进行HTML解析不是一个好主意）：

string shipCostHtml = Regex.Match(htmlSource, "(?<=<span class=\"plusShippingText\">).*?(?=</span>)").Value;
string shipCost = System.Net.WebUtility.HtmlDecode(shipCostHtml);
shipCost = shipCost.Trim(' ', '+', '\xa0');

你的正则表达式几乎没问题，你只需要用慵懒的(.*)替换贪婪的(.*?)。

如何使用HtmlAgilityPack解决问题。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
string shipCostHtml = doc.DocumentNode.SelectSingleNode("//span[@class='plusShippingText']").InnerText;
string shipCost = System.Net.WebUtility.HtmlDecode(shipCostHtml);
shipCost = shipCost.Trim(' ', '+', '\xa0');

现在，当亚马逊决定向<span>添加一些其他属性时，您将受到保护，例如：<span class='plusShippingText newClass'>或<span style='{color:blue}' class='plusShippingText'>等。

c＃regex输出字符串不符合我的期望

2 个答案: