HTML解析正则表达式

时间:2011-08-14 16:41:57

标签: php regex html-parsing

我想解析HTML文档并获取所有用户的昵称。

他们采用以下格式:

<a href="/nickname_u_2412477356587950963">Nickname</a>

如何在PHP中使用regular expression来完成?我不能使用DOMElement或简单的HTML解析。

2 个答案:

答案 0 :(得分:3)

这是一个不使用正则表达式的工作解决方案:

DomDocument :: loadHTML()遗忘了足以处理格式错误的HTML

<?php
    $doc = new DomDocument;
    $doc->loadHTML('<a href="/nickname_u_2412477356587950963">Nickname</a>');

    $xpath = new DomXPath($doc);
    $nodes = $xpath->query('//a[starts-with(@href, "/nickname")]');

    foreach($nodes as $node) {
        $username = $node->textContent;
        $href = $node->getAttribute('href');
        printf("%s => %s\n", $username, $href);
    }

答案 1 :(得分:3)

preg_match_all(
    '{                  # match when
        nickname_u_     # there is nickname_u
        [\d+]*          # followed by any number of digits
        ">              # followed by quote and closing bracket
        (.*)?           # capture anything that follows
        </a>            # until the first </a> sequence
    }xm',
    '<a href="/nickname_u_2412477356587950963">Nickname</a>',
    $matches
);
print_r($matches);

适用于HTML parser以上HTML上使用Regex的常用免责声明。以上可能会改进到更可靠的匹配。 It will work for the example you gave though.