使用PHP Simple HTML DOM Parser提取多个强标记

时间:2014-06-10 12:40:52

标签: php html parsing simple-html-dom

我有超过500页(静态)包含内容结构,

<section>
Some text 
<strong>Dynamic Title (Different on each page)</strong> 
<strong>Author name (Different on each page)</strong> 
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section> 

我需要使用PHP Simple HTML DOM Parser提取下面格式化的数据

$title = <strong>Dynamic Title (Different on each page)</strong> 
$authot = <strong>Author name (Different on each page)</strong> 
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)

到目前为止我失败了,无法理解它,感谢任何建议或代码片段来帮助我继续。

编辑1, 我现在用强标签解决了这个部分,

$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
 $content[] = $a->innertext;
}

$title= $content[0];                
$author= $content[1];

唯一剩下的问题是 - &gt;如何在括号内提取内容?用类似的方法?

3 个答案:

答案 0 :(得分:2)

确定首先要获取所有标签 然后,您想再次搜索标签和标签 像这样:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();

// Find all <sections>
foreach($html->find('section') as $element) {

    $section = $element->src;

    // get <strong> tags from <section>
    foreach($section->find('strong') as $strong) {
        $strong[] = $strong->src;
    }
     $title = $strong[0];
     $authot = $strong[1];
     $category = $strong[2];

}

要获取括号中的部分 - 只需获取b标记文本,然后添加()括号。 或者,如果您要问如何在括号之间获取部件 - 请使用explode然后移除右括号:

$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);

答案 1 :(得分:0)

$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
    $nodelist->item($i)->nodeValue; //gives you the text inside
}

答案 2 :(得分:0)

我现在可以使用的最终代码如下所示。

$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
 $content[] = $a->innertext;
}

$title= $content[0];                
$author= $content[1];
$category = $content[2];


$details = file_get_html($url)->plaintext; 
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);