提取源URL和字符串中的锚文本

时间:2012-11-23 12:31:32

标签: php

我试图从一系列字符串中提取数据,但没有运气。 在下面的示例代码中,我尝试使用preg_split,但它没有给我我想要的结果。

使用以下代码:

<?php
$str = '<a href="https://rads.stackoverflow.com/amzn/click/com/B008EYEYBA" rel="nofollow noreferrer">Nike Air Jordan SC-2 Mens Basketball Shoes 454050-035</a><img src="http://www.assoc-amazon.com/e/ir?t=mytwitterpage-20&l=as2&o=1&a=B008EYEYBA" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
';
$chars = preg_split('/ /', $str, -1, PREG_SPLIT_OFFSET_CAPTURE);

echo '<pre>';
print_r($chars);
echo '<pre>';
?>

给出了结果:

Array
(
    [0] => Array
        (
            [0] =>  0
        )

    [1] => Array
        (
            [0] => href="https://rads.stackoverflow.com/amzn/click/com/B008EYEYBA" rel="nofollow noreferrer">Nike
            [1] => 3
        )

    [2] => Array
        (
            [0] => Air
            [1] => 167
        )

    [3] => Array
        (
            [0] => Jordan
            [1] => 171
        )

    [4] => Array
        (
            [0] => SC-2
            [1] => 178
        )

    [5] => Array
        (
            [0] => Mens
            [1] => 183
        )

    [6] => Array
        (
            [0] => Basketball
            [1] => 188
        )

    [7] => Array
        (
            [0] => Shoes
            [1] => 199
        )

    [8] => Array
        (
            [0] => 454050-035 205
        )

    [9] => Array
        (
            [0] => src="http://www.assoc-amazon.com/e/ir?t=mytwitterpage-20&l=as2&o=1&a=B008EYEYBA"
            [1] => 224
        )

    [10] => Array
        (
            [0] => width="1"
            [1] => 305
        )

    [11] => Array
        (
            [0] => height="1"
            [1] => 315
        )

    [12] => Array
        (
            [0] => border="0"
            [1] => 326
        )

    [13] => Array
        (
            [0] => alt=""
            [1] => 337
        )

    [14] => Array
        (
            [0] => style="border:none
            [1] => 344
        )

    [15] => Array
        (
            [0] => !important;
            [1] => 363
        )

    [16] => Array
        (
            [0] => margin:0px
            [1] => 375
        )

    [17] => Array
        (
            [0] => !important;"
            [1] => 386
        )

    [18] => Array
        (
            [0] => />

            [1] => 399
        )

)

请注意,在array1中,当我只需要的时候,包含Nike这个词只是一个URL。

[1] => Array
        (
            [0] => href="https://rads.stackoverflow.com/amzn/click/com/B008EYEYBA" rel="nofollow noreferrer">Nike
            [1] => 3
        )

实际上,我提取$ str的最终目的只是将源URL和achor文本输出到一个单独的数组中,如下所示:

URL:

http://www.amazon.com/gp/product/B008EYEYBA/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B008EYEYBA&linkCode=as2&tag=mytwitterpage-20

锚文:

Nike Air Jordan SC-2男士篮球鞋454050-035

任何想法如何能够实现这一点非常感谢。

2 个答案:

答案 0 :(得分:0)

你可以在php函数的帮助下完成这个。

您想在此处删除锚标记。

您可以使用strip_tags()函数删除所有标记。

答案 1 :(得分:0)

使用常规expressoin来解析html是一种不好的做法。 PHP有DOM扩展名。你根本无法构建一个通用正则表达式,它适用于你可能遇到的任何html。 DOM方法更加可扩展。

$string = '<a href="http://rads.stackoverflow.com/amzn/click/B008EYEYBA">Nike Air Jordan SC-2 Mens Basketball Shoes 454050-035</a><img src="http://www.assoc-amazon.com/e/ir?t=mytwitterpage-20&l=as2&o=1&a=B008EYEYBA" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($string);
libxml_clear_errors();
$elementA = $dom->getElementsByTagName('a')->item(0);
$aText = $elementA->nodeValue;
$aLink = $elementA->getAttribute('href');
echo $aLink . "\n" . $aText;