HTML到JSON - 仅提取href属性标记

时间:2015-01-30 02:57:45

标签: php

我目前能够将HTML转换为JSON。我能够用函数element_to_obj解析html并获取带有html内容的json对象。主要关注点:是否可以仅在json对象中返回href标记的值并忽略其他所有内容?

function html_to_obj($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    return element_to_obj($dom->documentElement);
}

function element_to_obj($element) {
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
    <head>
        <title> This is a test </title>
    </head>
    <body>
        <h1> Go to a site? </h1>
        <ul>
            <li> <a href="http://example.com">Some Site</a> </li>
            <li> <a href="http://example.com">Some Site</a> </li>
        </ul>
        <h1> Other sites to visit: </h1>
        <div><a href="http://example.com">Some Site</a></div>
        <div><a href="http://example.com">Some Site</a></div>
        <div><a href="http://example.com">Some Site</a></div>
        <div><a href="http://example.com">Some Site</a></div>
    </body>
</html>
EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

2 个答案:

答案 0 :(得分:0)

我认为最好的方法就是制作一个简单的文本解析器。搜索每个JSON对象,查找href =&#34;的实例,然后返回该字符串(直到下一个非转义&#34;)。如果我没记错的话,Javascript有一些基本功能,比如string.substring,可以为此工作。或者,如果您知道如何使用正则表达式,则可以使用REGEX。

答案 1 :(得分:0)

您可以使用getElementsByTagName然后迭代所有元素。

<?php

function html_to_obj($html, $tag = 'a') {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    return element_to_obj($dom->getElementsByTagName($tag));
}

function element_to_obj($elements) {
    $obj = array();
    foreach($elements as $index => $element){

        $obj[$index] = array( "tag" => $element->tagName );
        foreach ($element->attributes as $attribute) {
            $obj[$index][$attribute->name] = $attribute->value;
        }
        foreach ($element->childNodes as $subElement) {
            if ($subElement->nodeType == XML_TEXT_NODE) {
                $obj[$index]["html"] = $subElement->wholeText;
            }
            else {
                $obj[$index]["children"][] = element_to_obj($subElement);
            }
        }
    }

    return $obj;
}

$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
    <head>
        <title> This is a test </title>
    </head>
    <body>
        <h1> Go to a site? </h1>
        <ul>
            <li> <a href="http://example.com">Some Site</a> </li>
            <li> <a href="http://example.com">Some Site</a> </li>
        </ul>
        <h1> Other sites to visit: </h1>
        <div><a href="http://example.com">Some Site</a></div>
        <div><a href="http://example.com">Some Site</a></div>
        <div><a href="http://example.com">Some Site</a></div>
        <div><a href="http://example.com">Some Site</a></div>
    </body>
</html>
EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);