Question

我有一个像这样的html（sample.html）：

<html>
<head>
</head>
<body>
<div id="content">
<!--content-->

<p>some content</p>

<!--content-->
</div>
</body>
</html>

如何使用php获取2 html评论''之间的内容部分？我想得到它，做一些处理并把它放回去，所以我必须得到并放！有可能吗？

Answer 1

esafwan - 你可以使用正则表达式来提取div（某个id）之间的内容。

我以前为图像标签做过这个，所以适用相同的规则。我会查看代码并稍微更新一下消息。

[更新] 试试这个：

<?php
    function get_tag( $attr, $value, $xml ) {

        $attr = preg_quote($attr);
        $value = preg_quote($value);

        $tag_regex = '/<div[^>]*'.$attr.'="'.$value.'">(.*?)<\\/div>/si';

        preg_match($tag_regex,
        $xml,
        $matches);
        return $matches[1];
    }

    $yourentirehtml = file_get_contents("test.html");
    $extract = get_tag('id', 'content', $yourentirehtml);
    echo $extract;
?>

或更简单：

preg_match("/<div[^>]*id=\"content\">(.*?)<\\/div>/si", $text, $match);
$content = $match[1];

吉姆

Answer 2

如果这是一个简单替换，不涉及解析实际的HTML文档，您可以使用正则表达式，甚至只使用str_replace。但通常情况下，it is not a advisable to use Regex for HTML因为HTML不常规且coming up with reliable patterns can quickly become a nightmare。

正确的方式to parse HTML in PHP是使用实际知道如何理解HTML文档的解析库。你最好的原生赌注是DOM，但PHP有other native XML extensions你可以使用的数量，还有很多第三方库，如phpQuery，Zend_Dom，{{ 3}}和QueryPath。

如果您使用FluentDom，那么您应该可以找到显示如何解决问题的示例。

Answer 3

<?php

    $content=file_get_contents("sample.html");
    $comment=explode("<!--content-->",$content);
    $comment=explode("<!--content-->",$comment[1]);
    var_dump(strip_tags($comment[0]));
?>

检查一下，它会对你有用

Answer 4

在这里查看代码示例，这意味着您可以将HTML文档加载到SimpleXML http://blog.charlvn.com/2009/03/html-in-php-simplexml.html

然后，您可以将其视为普通SimpleXML对象。

编辑：这只有在你想要标签中的内容时才有效（例如在＆lt; div＆gt;和＆lt; / div＆gt;之间）

Answer 5

问题在于嵌套的div 我找到了解决方案here

<?php // File: MatchAllDivMain.php
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON

// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?    1)</div>)*)</div>}si';

$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
    echo("$matchcount matches found.\n");
//  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("\n</pre>");
?>

使用php获取html标记内的内容并在处理后替换它

5 个答案: