我试图通过使用html-agility-pack(以及LINQ和XPath)来查找一系列HTML文档中的所有def group(to_group, group_like):
it = iter(to_group)
for sub_list in group_like:
yield [next(it) for _ in sub_list]
print(list(group(list1, list2)))
# [['Dog', 'Cat'], ['Monkey'], ['Parakeet'], ['Zebra']]
元素。
这些文档在标题中放置了脚本元素,在页脚中放置了Google Analytics。首先,我尝试定位标头脚本并删除它们。我的Notepad ++向我显示我有719个脚本元素,但我的控制台应用程序只找到其中的55个。
我需要帮助才能正确定位它们,所以我可以从文档中删除它们。
源文件(头部结构),
<script>
到目前为止,我已尝试使用JavaScript定位'语言'类型,但在解析html / head时只获得一些点击。我的方法从列表中获取文件名。目前,方法打印出列表中收集的脚本数量,这将更改为“Scripts.Remove();”一旦我得到了正确的搜索字符串。
<!doctype html system "html.dtd">
<html>
<head>
<link rel="stylesheet" href="../IRstyle.css" type="text/css">
<title>Non-hierarchic document clustering</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="keywords" content="">
<meta name="VW96.objecttype" content="Document">
<script language="JavaScript" type="text/JavaScript">
//Javascript-code goes here
</script>
</head>
<body>
<!--Body contents goes here-->
<!-- in footer -->
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-67XXXX-X";
urchinTracker();
</script>
</body>
</html>
如果某人有更好的方法来定位头节点中的JavaScripts,我们将非常感激。任何帮助表示赞赏! :)
答案 0 :(得分:2)
如果您只需要<script>
个元素节点,请使用descendant-or-self (//
)。示例HTML:
var html =
@"<!doctype html system 'html.dtd'>
<html>
<head>
<link rel='stylesheet' href='../IRstyle.css' type='text/css'>
<title>Non-hierarchic document clustering</title>
<meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1'>
<meta name='keywords' content=''>
<meta name='VW96.objecttype' content='Document'>
<script language='JavaScript' type='text/JavaScript'>
//Javascript-code goes here
</script>
</head>
<body>
<!--Body contents goes here-->
<!-- in footer -->
<script src='http://www.google-analytics.com/urchin.js' type='text/javascript'>
</script>
<script type='text/javascript'>
_uacct = 'UA-67XXXX-X';
urchinTracker();
</script>
</body>
</html>";
解析样本:
var document = new HtmlDocument();
document.LoadHtml(html);
// target only <script> in <head>
// var scriptTags = document.DocumentNode.SelectNodes("//head/script");
var scriptTags = document.DocumentNode.SelectNodes("//script");
foreach (var script in scriptTags) script.Remove();
document.Save(OUTPUT);
输出:
<!doctype html system 'html.dtd'>
<html>
<head>
<link rel='stylesheet' href='../IRstyle.css' type='text/css'>
<title>Non-hierarchic document clustering</title>
<meta http-equiv='Content-Type' content='text/html; charset=iso-8859-1'>
<meta name='keywords' content=''>
<meta name='VW96.objecttype' content='Document'>
</head>
<body>
<!--Body contents goes here-->
<!-- in footer -->
</body>
</html>