我正在使用ContentHandler来解析带有CSS样式的自定义html。
问题是 - 当我尝试使用UL
标记解析HTML时,ContentHandler会出错。
它会调用startTag()
然后endTag()
然后调用characters()
这是我的HTML
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>ul.ul1{list-style-type:image;}
</style>
</head>
<body>
<ul class="ul1">List</ul>
<ul class="ul2">List</ul>
</body>
</html>
以下是测试解析器的示例代码
public class ContentHandler implements org.xml.sax.ContentHandler {
public ContentHandler() {
}
public Spanned getResult() {
}
@Override
public void setDocumentLocator(Locator locator) {
}
@Override
public void startDocument() throws SAXException {
}
@Override
public void endDocument() throws SAXException {
}
@Override
public void startPrefixMapping(String prefix, String uri) throws SAXException {
}
@Override
public void endPrefixMapping(String prefix) throws SAXException {
}
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
Log.d("html_parser", "start " + localName);
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
Log.d("html_parser", "end " + localName);
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
String bodyText = new String(ch, start, length);
Log.d("html_parser", bodyText);
}
@Override
public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
}
@Override
public void processingInstruction(String target, String data) throws SAXException {
}
@Override
public void skippedEntity(String name) throws SAXException {
}
}
和LogCat输出
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start html
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start head
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start meta
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end meta
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start style
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ ul.ul1{list-style-type:image;}
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end style
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end head
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start body
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end body
02-13 13:18:41.555 13211-13211/com.example D/html_parser﹕ end html
请注意,当我解析没有UL
标记的HTML时,它可以正常工作。另请注意,对于解析使用org.ccil.cowan.tagsoup.jaxp.SAXParserImpl。
答案 0 :(得分:2)
我已经测试了你的问题并发现了一些有趣的事实。你已经使用SAX解析器来解析html,所以html与xml有很多不同。例如,有时标签可以是未封闭的等等。所以org.ccil.cowan.tagsoup.jaxp.SAXParserImpl允许我们解析html。此解析器还包含一些附加标记https://github.com/websdotcom/tagsoup#what-tagsoup-does。在下一个代码中查找html。如果你要添加正确处理的内容的正确结构。所以我认为这就像TagSoup lib中的bug一样。
import android.test.AndroidTestCase;
import android.util.Log;
import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import javax.xml.parsers.SAXParser;
/**
* Created by kulik on 1/5/14.
*/
public class SaxTest extends AndroidTestCase {
private static final String TAG = "SaxTest";
public void testSax() {
String testString = "<!DOCTYPE html>\n" +
"<html>\n" +
"<head>\n" +
"<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n" +
"<style>ul.ul1{list-style-type:image;}\n" +
"</style>\n" +
"</head>\n" +
"<body>\n" +
"<ul class=\"ul1\">List</ul>\n" +
"<ul class=\"ul2\">" +
"<li> li1</li>\n" +
"<li> li2</li>\n" +
"</ul>" +
"</body>\n" +
"</html>";
Reader reader = new StringReader(testString);
try {
SAXParser sp = SAXParserImpl.newInstance(null);
XMLReader xr = sp.getXMLReader();
DefaultHandler myHandler = new ContentHandler();
xr.setContentHandler(myHandler);
xr.parse(new InputSource(reader));
} catch (SAXException e) {
Log.e(TAG, "", e);
} catch (IOException e) {
Log.e(TAG, "", e);
}
}
public class ContentHandler extends DefaultHandler {
@Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
Log.d("html_parser", "start " + localName);
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
Log.d("html_parser", "end " + localName);
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
String bodyText = new String(ch, start, length);
Log.d("html_parser", bodyText);
}
}
}
并记录
D/html_parser﹕ start html
D/html_parser﹕ start head
D/html_parser﹕ start meta
D/html_parser﹕ end meta
D/html_parser﹕ start style
D/html_parser﹕ ul.ul1{list-style-type:image;}
D/html_parser﹕ end style
D/html_parser﹕ end head
D/html_parser﹕ start body
D/html_parser﹕ start ul
D/html_parser﹕ end ul
D/html_parser﹕ List
D/html_parser﹕ start ul
D/html_parser﹕ start li
D/html_parser﹕ li1
D/html_parser﹕ end li
D/html_parser﹕ start li
D/html_parser﹕ li2
D/html_parser﹕ end li
D/html_parser﹕ end ul
D/html_parser﹕ end body
D/html_parser﹕ end html
因此,您可以实现处理程序来捕获这种情况,因为我认为这只与没有任何
TagSoup的语义与实际HTML的语义一样实用 浏览器。特别是,从来没有,它永远不会抛出任何类型的语法 错误:TagSoup的座右铭是“Just Keep On Truckin”。但是有很多, 多得多。例如,如果第一个标签是LI,它将提供 包含HTML,BODY和UL标记的应用程序。为何选择UL?因为 这就是浏览器在这种情况下所假设的。出于同样的原因, 重叠的标签正确地重新启动了.......
http://home.ccil.org/~cowan/XML/tagsoup/
你也可以问问候团队。