找到第一个html标记的名称,不包括锚标记,如果它存在于字符串中

时间:2014-10-23 17:04:33

标签: php regex

下面是字符串的不同实例的列表。我正在寻找一个正则表达式,它将返回字符串中第一个html标记的名称。

例外: 如果锚标记<a>是字符串中的第一个标记,那么它应该返回空字符串''。

此外,如果字符串没有任何html标记,那么它应该返回空字符串''。

$string = '<h6>Test content</h6>';
// Expected output = h6

$string = '<h6 class="my-class">Test content</h6>';
// Expected output = h6

$string = '<div>Test content</div>';
// Expected output = div

$string = '<div id="my-id" class="my-class">Test content</div>';
// Expected output = div

$string = '<div><a href="test.html">Test content</a></div>';
// Expected output = div

$string = '<div class="my-class"><a href="test.html">Test content</a></div>';
// Expected ouput = div

$string = '<a href="test.html">Test content</a>';
// Expected output = empty string
// It should return empty string if the first html tag is <a>

$string = "Test content";
// Expected output = empty string
// It should return empty string if there is not html tags wrapper.

请帮助!!!

1 个答案:

答案 0 :(得分:1)

将默认$element设置为空字符串,该字符串将用于传入字符串中没有HTML以及第一个元素为a时的情况。首先检查字符串是否包含任何HTML标记。将传入的字符串与strip_tags($string)的值进行比较。如果它们相同,则没有HTML标记跳到底部并返回$element这是一个空白字符串。

如果有HTML标记,请将其加载到DOMDocument中并使用XPath获取第一个节点名称。如果是a如果不是a,请使用节点名称设置$element。打破循环。

XPath包含/html/body/*,因为当您将loadHTML()与无效或部分HTML一起使用时,它会添加<html><body>标记。对于不包含任何HTML的字符串,它还会添加<p>标记。

function getFirstElement($string) {
    $element = '';
    // check for any HTML tags
    if($string !== strip_tags($string)) {
        $doc = new DOMDocument();
        $doc->loadHTML($string);
        $xpath = new DOMXPath($doc);
        foreach ($xpath->query('/html/body/*') as $node) {
            // check for a tag
            if((string)$node->nodeName != 'a') {
                // check for string passed in with tag in middle of string, loadHTML adds p tag so skip it
                if(substr($string, 0, 1) != '<' && (string)$node->nodeName == 'p') continue;
                $element = (string)$node->nodeName;
                break;
            } else {
                break;
            }
        }
    }
    return $element;
}

$string = '<h6>Test content</h6>';
getFirstElement($string);
returns 'h6'

$string = '<h6 class="my-class">Test content</h6>';
getFirstElement($string);
returns 'h6'

$string = '<div>Test content</div>';
getFirstElement($string);
returns 'div'

$string = '<div id="my-id" class="my-class">Test content</div>';
getFirstElement($string);
returns 'div'

$string = '<div><a href="test.html">Test content</a></div>';
getFirstElement($string);
returns 'div'

$string = '<div class="my-class"><a href="test.html">Test content</a></div>';
getFirstElement($string);
returns 'div'

$string = '<a href="test.html">Test content</a>';
getFirstElement($string);
returns ''

$string = "Test content";
getFirstElement($string);
returns ''

$string = "Test <div>content</div>";
getFirstElement($string);
returns 'div'

$string = "<p>Test content</p>";
getFirstElement($string);
returns 'p'

所以当你使用DOMDocument时,你可以看到它的样子:: loadHTML()这里是DOMDocument :: saveHTML()的输出。

$string = '<h6>Test content</h6>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><h6>Test content</h6></body></html>

$string = '<h6 class="my-class">Test content</h6>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><h6 class="my-class">Test content</h6></body></html>

$string = '<div>Test content</div>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>Test content</div></body></html>

$string = '<div id="my-id" class="my-class">Test content</div>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div id="my-id" class="my-class">Test content</div></body></html>

$string = '<div><a href="test.html">Test content</a></div>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div><a href="test.html">Test content</a></div></body></html>

$string = '<div class="my-class"><a href="test.html">Test content</a></div>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div class="my-class"><a href="test.html">Test content</a></div></body></html>

$string = '<a href="test.html">Test content</a>';
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><a href="test.html">Test content</a></body></html>

$string = "Test content";
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Test content</p></body></html>

$string = "Test <div>content</div>";
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Test </p><div>content</div></body></html>

$string = "<p>Test content</p>";
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Test content</p></body></html>