很抱歉,如果这太简单了,但我根本找不到教程,也没有找到Java版TagSoup的文档。
基本上我想从互联网上下载HTML网页并将其转换为包含在字符串中的XHTML。我怎么能用TagSoup做到这一点?
谢谢!
答案 0 :(得分:8)
这样的事情:
wget -O - example.com/bad.html | java -jar tagsoup.jar
或者,来自Java:
解析HTML:
- 创建
的实例org.ccil.cowan.tagsoup.Parser
- 提供您自己的SAX2 ContentHandler
- 提供引用HTML的
InputSource
- 和
parse()
!
答案 1 :(得分:0)
下面的代码应该为您提供一种方法来下拉网页并使用TagSoup进行相应的解析......
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet("http://streak.espn.go.com/en/?date=20120824");
HttpResponse response = client.execute(request);
// Check if server response is valid
StatusLine status = response.getStatusLine();
if (status.getStatusCode() != 200) {
throw new IOException("Invalid response from server: " + status.toString());
}
// Pull content stream from response
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
try
{
XMLReader parser = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
// Use the TagSoup parser to build an XOM document from HTML
Document doc = new Builder(parser).build(builder.toString());
// Push your data to string or XML
doc.toString();
doc.toXML();
}
catch(IOException e)
{ ... }