Question

我正在使用simple_html_dom.php从HTML网站中删除数据，并将其写入XML格式。以下是脚本废弃的数据的HTML源代码示例。

<h3>Background</h3>
<ol>
   <li><strong>Text here</strong>The text here text text text</li>
   <li>The text here text text <br/> text</li>
</ol>
<p>Text here</p>
<h3>Job Description</h3>

以下几行仅废弃内容（文字）并忽略HTML元素，例如：<ol>, <li>, <br/>

$html = file_get_html($url) ;
$xmlPageDom = new DomDocument();
@$xmlPageDom->loadHTML($html);
$xmlPageXPath = new DOMXPath($xmlPageDom);

$value1 = $xmlPageXPath->query('//text()[preceding::h3[text()="Background"] and following-sibling::h3[text()="Job Description"]]');
$value2 = $xmlPageXPath->query('//node()[preceding::h3[text()="Background"] and following-sibling::h3[text()="Job Description"]]/node()');
$tag = "background";        
$XML .= createXMLtags($tag,nodelists2string($value1, $value2));

function nodelist2string($nodelist){
        $result="";
        foreach($nodelist as $node){
            $result.="<".$node->nodeName.">";
            if ($node->hasChildNodes()){
                $result.=nodelist2string($node);
            }
            $result.=$node->nodeValue;
            $result.="</".$node->nodeName.">";
        }
        return $result;
}

function nodelists2string($nodelist1, $nodelist2){
    $result="";
    foreach($nodelist1 as $node){
        $result.="<".$node->nodeName.">";
        if ($node->hasChildNodes()){
            $result.=nodelist2string($node);
        }
        $result.=$node->nodeValue;
        $result.="</".$node->nodeName.">";
    }
    foreach($nodelist2 as $node){
        $result.="<".$node->nodeName.">";
        if ($node->hasChildNodes()){
            $result.=nodelist2string($node);
        }
        $result.=$node->nodeValue;
        $result.="</".$node->nodeName.">";
    }
    return $result;
}

如何废弃包含内部HTML的文字？目前，脚本上的废纯文本。我也尝试过关注strip_tags，它仅适用于<li>，不适用于其他HTML元素。

$value=strip_tags($value,'<li>');

我尝试了saveHTML，但无法弄清楚，究竟要添加它。

Answer 1

经过调查，我发现html源代码正在废弃代码。我使用echo $html;并看到所有innerhtml内容都存在，但以下代码忽略了html元素，只抓取纯文本。

$value1 = $xmlPageXPath->query('//text()[preceding::h3[text()="Background"] and following-sibling::h3[text()="Job Description"]]');
$value2 = $xmlPageXPath->query('//node()[preceding::h3[text()="Background"] and following-sibling::h3[text()="Job Description"]]/node()');
$tag = "background";        
$XML .= createXMLtags($tag,nodelists2string($value1, $value2));

我使用preg_replace查找并用html encoded entities替换html标记。将内容导入我的数据库后，html entities转换回解码版本，文本以格式化形式显示。

$html=preg_replace("/<br \/>/i",'&lt;br&gt;', $html)

我为上面提到的每个html元素使用了上面的行。

Answer 2

据我所知，它不能用简单的html dom完成，但如果切换到this one，你可以这样做：

$str = <<<EOF
<h3>Background</h3>
<ol>
   <li><strong>Text here</strong>The text here text text text</li>
   <li>The text here text text <br/> text</li>
</ol>
<p>Text here</p>
<h3>Job Description</h3>
EOF;

$html = str_get_html($str);

echo $html->text;
/* will output:
Background
Text hereThe text here text text text
   The text here text text  text
Text here
Job Description
*/

使用simple_html_dom使用内部HTML抓取内容

2 个答案: