我有很多需要用XML传播的HTML文件。我们使用这些HTML为应用程序提供内容,但现在我们必须将这些内容作为XML提供。
HTML文件包含,表格,div,图像,p,b或强标签等。
我用Google搜索并找到了一些应用程序,但我还是无法实现。
您能否建议一种将这些文件内容转换为XML的方法?
答案 0 :(得分:2)
我确实找到了将(甚至是坏的)html转换为格式良好的XML的方法。我开始基于DOM loadHTML函数。然而,在几个问题发生时,我优化并添加补丁以纠正副作用。
function tryToXml($dom,$content) {
if(!$content) return false;
// xml well formed content can be loaded as xml node tree
$fragment = $dom->createDocumentFragment();
// wonderfull appendXML to add an XML string directly into the node tree!
// aappendxml will fail on a xml declaration so manually skip this when occurred
if( substr( $content,0, 5) == '<?xml' ) {
$content = substr($content,strpos($content,'>')+1);
if( strpos($content,'<') ) {
$content = substr($content,strpos($content,'<'));
}
}
// if appendXML is not working then use below htmlToXml() for nasty html correction
if(!@$fragment->appendXML( $content )) {
return $this->htmlToXml($dom,$content);
}
return $fragment;
}
// convert content into xml
// dom is only needed to prepare the xml which will be returned
function htmlToXml($dom, $content, $needEncoding=false, $bodyOnly=true) {
// no xml when html is empty
if(!$content) return false;
// real content and possibly it needs encoding
if( $needEncoding ) {
// no need to convert character encoding as loadHTML will respect the content-type (only)
$content = '<meta http-equiv="Content-Type" content="text/html;charset='.$this->encoding.'">' . $content;
}
// return a dom from the content
$domInject = new DOMDocument("1.0", "UTF-8");
$domInject->preserveWhiteSpace = false;
$domInject->formatOutput = true;
// html type
try {
@$domInject->loadHTML( $content );
} catch(Exception $e){
// do nothing and continue as it's normal that warnings will occur on nasty HTML content
}
// to check encoding: echo $dom->encoding
$this->reworkDom( $domInject );
if( $bodyOnly ) {
$fragment = $dom->createDocumentFragment();
// retrieve nodes within /html/body
foreach( $domInject->documentElement->childNodes as $elementLevel1 ) {
if( $elementLevel1->nodeName == 'body' and $elementLevel1->nodeType == XML_ELEMENT_NODE ) {
foreach( $elementLevel1->childNodes as $elementInject ) {
$fragment->insertBefore( $dom->importNode($elementInject, true) );
}
}
}
} else {
$fragment = $dom->importNode($domInject->documentElement, true);
}
return $fragment;
}
protected function reworkDom( $node, $level = 0 ) {
// start with the first child node to iterate
$nodeChild = $node->firstChild;
while ( $nodeChild ) {
$nodeNextChild = $nodeChild->nextSibling;
switch ( $nodeChild->nodeType ) {
case XML_ELEMENT_NODE:
// iterate through children element nodes
$this->reworkDom( $nodeChild, $level + 1);
break;
case XML_TEXT_NODE:
case XML_CDATA_SECTION_NODE:
// do nothing with text, cdata
break;
case XML_COMMENT_NODE:
// ensure comments to remove - sign also follows the w3c guideline
$nodeChild->nodeValue = str_replace("-","_",$nodeChild->nodeValue);
break;
case XML_DOCUMENT_TYPE_NODE: // 10: needs to be removed
case XML_PI_NODE: // 7: remove PI
$node->removeChild( $nodeChild );
$nodeChild = null; // make null to test later
break;
case XML_DOCUMENT_NODE:
// should not appear as it's always the root, just to be complete
// however generate exception!
case XML_HTML_DOCUMENT_NODE:
// should not appear as it's always the root, just to be complete
// however generate exception!
default:
throw new exception("Engine: reworkDom type not declared [".$nodeChild->nodeType. "]");
}
$nodeChild = $nodeNextChild;
} ;
}
现在这也允许将更多html片段添加到我需要自己使用的XML中。一般来说,它可以像这样使用:
$c='<p>test<font>two</p>';
$dom=new DOMDocument('1.0', 'UTF-8');
$n=$dom->appendChild($dom->createElement('info')); // make a root element
if( $valueXml=tryToXml($dom,$c) ) {
$n->appendChild($valueXml);
}
echo '<pre/>'. htmlentities($dom->saveXml($n)). '</pre>';
在这个示例中,'<p>test<font>two</p>'
很好地在格式良好的XML中输出为“<info><p>test<font>two</font></p></info>
”。添加了info根标记,因为它还允许转换不是XML的“<p>one</p><p>two</p>
”,因为它没有一个根元素。但是,如果html确实有一个根元素,则可以跳过额外的根<info>
标记。
有了这个,我从非结构化甚至已损坏的HTML中获得了真正优秀的XML!
我希望它有点清晰,并可能有助于其他人使用它。
答案 1 :(得分:1)
请记住,HTML和XML是标记语言树中的两个不同概念。你不能完全replace HTML with XML。 XML可以被视为HTML的一般形式,但即便如此也是不精确的。您主要使用HTML来显示数据,使用XML来传输(或存储)数据。
此链接很有用:How to read HTML as XML?