我想从以下HTML解析第二个div:
<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>
即,此值:<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>
id可以包含任何数字。
以下是我的尝试:
Regex rgx = new Regex(@"'post-body-\d*'");
var res = rgx.Replace("<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>", "");
我希望结果为<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>
,但这不是我得到的结果。
答案 0 :(得分:1)
如果您100%确定数字前后的文本将始终相同,则可以使用String类的.IndexOf和.Substring方法将字符串分解为多个部分。
string original = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"
// IndexOf returns the position in the string where the piece we are looking for starts
int startIndex = original.IndexOf(@"<div class='post-body entry-content' id='post-body-");
// For the endIndex, add the number of characters in the string that you are looking for
int endIndex = original.IndexOf(@"' itemprop='articleBody'>") + 25;
// this substring will retrieve just the inner part that you are looking for
string newString = original.Substring(startIndex, endIndex - startIndex);
// newString should now equal "<div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'>"
// or, if you want to just remove the inner part, build a different string like this:
// First, get everything leading up to the startIndex
string divString = original.Substring(0, startIndex);
// then, add everything after the endIndex
divString += original.Substring(endIndex);
// divString should now equal "<div kubedfiuabefiudsabiubfg><div kubedfiuabefiudsabiubfg>"
希望这会有所帮助...
答案 1 :(得分:1)
您未获得预期结果的原因是您的正则表达式字符串仅搜索'post-body-\d*'
,而不是div
标记的其余部分。此外,执行Regex.Replace实际上会替换您要搜索的文本,而不是返回它,因此您最终会获得所有但您正在搜索的文本。
尝试使用Regex.Matches替换@ "<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>"
的正则表达式字符串(如果您只关心第一次出现,请Regex.Match),并处理Matches。
例如:
string htmlText = @"<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>";
Regex rgx = new Regex(@`"<div class='post-body entry-content' id='post-body-\d*' itemprop='articleBody'>");
foreach (Match match in rgx.Matches(htmlText))
{
// Process matches
Console.WriteLine(match.ToString());
}
答案 2 :(得分:0)
您可以将HTML片段解析为XML片段并直接提取id
属性,例如。
var html = "<div kubedfiuabefiudsabiubfg><div class='post-body entry-content' id='post-body-7494158715135407463' itemprop='articleBody'><div kubedfiuabefiudsabiubfg>"
var data = XElement.Parse(html).Element("div").Attribute("id");