我必须使用ant从网页中提取数字。我已经使用任务下载了页面。 Ma页面是:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<TITLE>Index of .......</TITLE>
</HEAD>
<BODY>
<H1>Index of .....</H1>
<PRE><IMG SRC="/icons/blank.gif" ALT=" "> <A HREF="?N=A">Name</A> <A HREF="?M=D">Last modified</A> <A HREF="?S=A">Size</A> <A HREF="?D=A">Description</A>
<HR>
<IMG SRC="/icons/back.gif" ALT="[DIR]"> <A HREF="/projects/i/">Parent Directory</A> 19-Dec-2012 11:39 -
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="20120114-1731/">20120114-1731/</A> 14-Feb-2012 17:40 -
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="20120115-1055/">20120115-1055/</A> 15-Feb-2012 11:04 -
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="20120115-1336/">20120115-1336/</A> 15-Feb-2012 13:44 -
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="20120115-1656/">20120115-1656/</A> 15-Feb-2012 17:05 -
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="20120115-2157/">20120115-2157/</A> 15-Feb-2012 22:06 -
</PRE><HR>
<ADDRESS>Apache/1.3.41 Server at romgsa.ibm.com Port 443</ADDRESS>
</BODY></HTML>
自: &lt; IMG SRC =“/ icons / folder.gif”ALT =“[DIR]”&gt; &lt; A HREF =“20120114-1731 /”&amp; GT; 20120114-1731 /&LT; / A&GT;一世 我必须提取“20120114-1731”
答案 0 :(得分:0)
以下示例嵌入了groovy脚本。 Groovy有一个有用的Grab注释,可以用来下载像htmlcleaner这样的Java库,它可以将HTML页面解析为XML。
bootstrap目标将下载并安装groovy。
$ ant bootstrap
运行构建会产生以下预期输出:
$ ant
..
parse:
[groovy] 20120114-1731/
[groovy] 20120115-1055/
[groovy] 20120115-1336/
[groovy] 20120115-1656/
[groovy] 20120115-2157/
<project name="demo" default="parse">
<target name="bootstrap">
<mkdir dir="${user.home}/.ant/lib"/>
<get dest="${user.home}/.ant/lib/groovy-all.jar" src="http://search.maven.org/remotecontent?filepath=org/codehaus/groovy/groovy-all/2.1.1/groovy-all-2.1.1.jar"/>
<get dest="${user.home}/.ant/lib/ivy.jar" src="http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar"/>
</target>
<target name="parse">
<taskdef name="groovy" classname="org.codehaus.groovy.ant.Groovy"/>
<groovy>
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleXmlSerializer;
@Grab(group='net.sourceforge.htmlcleaner', module='htmlcleaner', version='2.2.1')
// HTML page to parse
def address = 'file:///path/to/example/page.html'
// Clean any messy HTML
def cleaner = new HtmlCleaner()
def node = cleaner.clean(address.toURL())
// Convert from HTML to XML
def serializer = new SimpleXmlSerializer(cleaner.getProperties())
def xml = serializer.getXmlAsString(node)
// Parse the XML into a document we can work with
def page = new XmlSlurper(false,false).parseText(xml)
// Retrieve the anchor tag values matching a pattern
def numbers = page.body.pre.a.findAll { it.toString().startsWith("2012") }
numbers.each {
println it
}
</groovy>
</target>
</project>