我正在尝试使用this function将一些已删除的HTML解析为有效的xml。
我的测试代码(从Ben Nadel的博客复制并粘贴了htmlParse函数):
<cfscript>
// I take an HTML string and parse it into an XML(XHTML)
// document. This is returned as a standard ColdFusion XML
// document.
function htmlParse( htmlContent, disableNamespaces = true ){
// Create an instance of the Xalan SAX2DOM class as the
// recipient of the TagSoup SAX (Simple API for XML) compliant
// events. TagSoup will parse the HTML and announce events as
// it encounters various HTML nodes. The SAX2DOM instance will
// listen for such events and construct a DOM tree in response.
var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();
// Create our TagSoup parser.
var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();
// Check to see if namespaces are going to be disabled in the
// parser. If so, then they will not be added to elements.
if (disableNamespaces){
// Turn off namespaces - they are lame an nobody likes
// to perform xmlSearch() methods with them in place.
tagSoupParser.setFeature(
tagSoupParser.namespacesFeature,
javaCast( "boolean", false )
);
}
// Set our DOM builder to be the listener for SAX-based
// parsing events on our HTML.
tagSoupParser.setContentHandler( saxDomBuilder );
// Create our content input. The InputSource encapsulates the
// means by which the content is read.
var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
createObject( "java", "java.io.StringReader" ).init( htmlContent )
);
// Parse the HTML. This will trigger events which the SAX2DOM
// builder will translate into a DOM tree.
tagSoupParser.parse( inputSource );
// Now that the HTML has been parsed, we have to get a
// representation that is similar to the XML document that
// ColdFusion users are used to having. Let's search for the
// ROOT document and return is.
return(
xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
);
}
</cfscript>
<cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset parsedData = htmlParse(html) />
(html是以不同的函数从这种格式接收的,但我现在尝试对字符串进行硬编码以跟踪问题。)
我收到以下错误:
NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist.
The error occurred in myfilePath/myfileName.cfm: line 42
40 : // Parse the HTML. This will trigger events which the SAX2DOM
41 : // builder will translate into a DOM tree.
42 : tagSoupParser.parse( inputSource );
出了什么问题?我该如何纠正?
答案 0 :(得分:3)
我还没有使用过TagSoup,但我多年来一直使用jTidy并取得了很好的效果,可以从各种来源(包括MS Word)中获取用户提供的HTML并清理它以便它返回XHTML。
您可以通过将jTidy jar放到类路径上或使用JavaLoader加载它来对同一文档尝试jTidy。由于您使用的是CF10,因此可以使用this method to include the JAR。
然后,这里是如何在cfscript中调用jTidy:
jTidy = createObject("java", "org.w3c.tidy.Tidy");
jTidy.setQuiet(false);
jTidy.setIndentContent(true);
jTidy.setSmartIndent(true);
jTidy.setIndentAttributes(true);
jTidy.setWraplen(1024);
jTidy.setXHTML(true);
jTidy.setNumEntities(true);
jTidy.setConvertWindowsChars(true);
jTidy.setFixBackslash(true); // changes \ in urls to /
jTidy.setLogicalEmphasis(true); // uses strong/em instead of b/i
jTidy.setDropEmptyParas(true);
// create the in and out streams for jTidy
readBuffer = CreateObject("java","java.lang.String").init(parseData).getBytes();
inP = createobject("java","java.io.ByteArrayInputStream").init(readBuffer);
outx = createObject("java", "java.io.ByteArrayOutputStream").init();
// do the parsing
jTidy.parse(inP,outx);
outstr = outx.toString();
这将返回有效的XHTML,您可以使用XPath查询该XHTML。我将上面的内容包装到makeValid()函数中,然后针对HTML运行它:
<cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset out = makeValid(html) />
<cfdump var="#xmlParse(out)#" />
这是输出: