使用XPath查找所有JavaScript类型的脚本元素

时间:2018-02-15 16:12:05

标签: c# linq xpath html-agility-pack

我试图通过使用html-agility-pack(以及LINQ和XPath)来查找一系列HTML文档中的所有def group(to_group, group_like): it = iter(to_group) for sub_list in group_like: yield [next(it) for _ in sub_list] print(list(group(list1, list2))) # [['Dog', 'Cat'], ['Monkey'], ['Parakeet'], ['Zebra']] 元素。 这些文档在标题中放置了脚本元素,在页脚中放置了Google Analytics。首先,我尝试定位标头脚本并删除它们。我的Notepad ++向我显示我有719个脚本元素,但我的控制台应用程序只找到其中​​的55个。

我需要帮助才能正确定位它们,所以我可以从文档中删除它们。

源文件(头部结构),

<script>

到目前为止,我已尝试使用JavaScript定位'语言'类型,但在解析html / head时只获得一些点击。我的方法从列表中获取文件名。目前,方法打印出列表中收集的脚本数量,这将更改为“Scripts.Remove();”一旦我得到了正确的搜索字符串。

<!doctype html system "html.dtd">
<html>
<head>
<link rel="stylesheet" href="../IRstyle.css" type="text/css">
<title>Non-hierarchic document clustering</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="keywords" content="">
<meta name="VW96.objecttype" content="Document">

<script language="JavaScript" type="text/JavaScript">
//Javascript-code goes here
</script>
</head>
<body>    
<!--Body contents goes here-->

<!-- in footer -->
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-67XXXX-X";
urchinTracker();
</script>
</body>
</html>

如果某人有更好的方法来定位头节点中的JavaScripts,我们将非常感激。任何帮助表示赞赏! :)

1 个答案:

答案 0 :(得分:2)

如果您只需要<script>个元素节点,请使用descendant-or-self (//)。示例HTML:

var html =
@"<!doctype html system 'html.dtd'>
<html>
<head>
<link rel='stylesheet' href='../IRstyle.css' type='text/css'>
<title>Non-hierarchic document clustering</title>
<meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1'>
<meta name='keywords' content=''>
<meta name='VW96.objecttype' content='Document'>
<script language='JavaScript' type='text/JavaScript'>
//Javascript-code goes here
</script>
</head>
<body>    
<!--Body contents goes here-->

<!-- in footer -->
<script src='http://www.google-analytics.com/urchin.js' type='text/javascript'>
</script>
<script type='text/javascript'>
_uacct = 'UA-67XXXX-X';
urchinTracker();
</script>
</body>
</html>";

解析样本:

var document = new HtmlDocument();
document.LoadHtml(html);
// target only <script> in <head>
// var scriptTags = document.DocumentNode.SelectNodes("//head/script");
var scriptTags = document.DocumentNode.SelectNodes("//script");

foreach (var script in scriptTags) script.Remove();    

document.Save(OUTPUT);

输出:

<!doctype html system 'html.dtd'>
<html>
<head>
<link rel='stylesheet' href='../IRstyle.css' type='text/css'>
<title>Non-hierarchic document clustering</title>
<meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1'>
<meta name='keywords' content=''>
<meta name='VW96.objecttype' content='Document'>

</head>
<body>    
<!--Body contents goes here-->

<!-- in footer -->


</body>
</html>