我正在尝试从以下html文件中获取“NAME”和“EMAIL”文本:
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
<ol>
<li>
<font class="normal">
<b>NAME</b> <a href="/member/mail_compose.aspx?id=name"><img src="/images/mailbox.gif" border="0" alt="Send Mail" /></a> <a href="/photos/member_viewphoto.aspx?id=name"><img src="/images/icons/member_photos.gif" border="0" alt="View Photos" /></a> <br />
ADDRESS<br />
PHONE<br />
<a href="mailto:email@hotmail.com" class="redlink">EMAIL</a><br />
<br />
</font>
</li>
</body>
</html>
以下是我正在使用的代码:
// Load the xml document
XDocument xDoc = XDocument.Load(@"..\..\Directory.html");
// Parse document
var names = xDoc.Root.DescendantsAndSelf()
.Where(x => x.Name.LocalName == "ol").DescendantsAndSelf()
.Where(x => x.Name.LocalName == "li").DescendantsAndSelf()
.Select(x => new
{
name = x.Elements().Where(y => y.Name.LocalName == "b").Select(y => y.Value),
email = x.DescendantsAndSelf().Where(y => y.Name.LocalName == "a" && x.FirstAttribute.Name == "href" && x.Attribute("href").Value.Contains("mailto")).Select(y => y.Value ?? "No Email")
}
);
// Print text to console
for (int i = 0; i < names.Count(); i++)
{
Console.WriteLine("{0}: {1}", names.ElementAt(i).name, names.ElementAt(i).email);
}
不知何故,上面的代码正在打印这个:
System.Linq.Enumerable + WhereSelectEnumerableIterator
2[System.Xml.Linq.XElement, System.String]: System.Linq.Enumerable+WhereSelectEnumerableIterator
2 [System.Xm l.Linq.XElement,System.String]
有人可以告诉我为什么会这样吗?此外,如果有更好的方法,建议将是非常受欢迎的。
答案 0 :(得分:1)
不检查null(注意我使用的大多数地方FirstorDefault可能会抛出NullExceptions,因为我没有在解决方案中检查null。
var htmlToProcess =
@"<!DOCTYPE html>
<html lang='en' xmlns='http://www.w3.org/1999/xhtml'>
<head>
<meta charset='utf-8' />
<title></title>
</head>
<body>
<ol>
<li>
<font class='normal'>
<b>NAME</b> <a href='/member/mail_compose.aspx?id=name'><img src='/images/mailbox.gif' border='0' alt='Send Mail' /></a> <a href='/photos/member_viewphoto.aspx?id=name'><img src='/images/icons/member_photos.gif' border='0' alt='View Photos' /></a> <br />
ADDRESS<br />
PHONE<br />
<a href='mailto:email@hotmail.com' class='redlink'>EMAIL</a><br />
<br />
</font>
</li>
</ol>
</body>
</html>";
var body = dataSet1Tree.Nodes()
.OfType<XElement>()
.FirstOrDefault(x=> x.Name.LocalName.ToLower() =="body");
if (body != null)
{
var oi = body.Descendants()
.FirstOrDefault(x => x.Name.LocalName.ToLower() == "ol");
if (oi != null)
{
var lis = oi.Elements()
.Where(x=> x.Name.LocalName.ToLower()=="li");
var listContainingInfo =from font in lis.Select(li => body.Descendants()
.FirstOrDefault(x => x.Name.LocalName.ToLower() == "font"))
.Where(font => font != null)
select font.Nodes().OfType<XElement>();
var listOfUsers = listContainingInfo.Select(nodes => new
{
Name = nodes.FirstOrDefault(innerNode => innerNode.Name.LocalName.ToLower() == "b").Value,
Email = nodes.FirstOrDefault(innerNode => innerNode.Value == "EMAIL")
.Attributes("href")
.FirstOrDefault()
.Value
});
foreach (var user in listOfUsers)
Console.WriteLine(user.Name +" "+ user.Email);
}
}
答案 1 :(得分:0)
要回答您的第一个问题(对您来说可能比我必须使用该示例HTML的代码更重要),请选择。选择您的姓名和电子邮件字段。这就是为什么在循环名称时返回集合的原因。如果这实际上是您想要的,那么在创建匿名对象时执行SelectMany而不是Select。
如果没有架构,我就不知道如何在&#34;。选择&#34;
之前更好地进行XML遍历。另一个问题是,对于href属性,您需要与FirstAttribute.Name.LocalName而不仅仅是FirstAttribute.Name进行比较
var names = xDoc.Root.DescendantsAndSelf()
.Where(x => x.Name.LocalName == "ol").DescendantsAndSelf()
.Where(x => x.Name.LocalName == "li").DescendantsAndSelf()
.Where(x => x.Name.LocalName == "font")
.Select(x => new
{
name = x.Descendants().Where(y => y.Name.LocalName == "b").Select(y => y.Value).Single(),
email = x.Descendants().Where(y => y.Name.LocalName == "a" && y.FirstAttribute.Name.LocalName == "href" && y.Attribute("href").Value.Contains("mailto")).Select(y => y.Value).Single()
});
一些说明:
y.Value ?? "No Email"
需要重做,因为y.Value永远不会为null
你也错过了html中的ol标签:)