如何附加到字符串中的所有URL?

时间:2010-06-07 14:38:13

标签: php

我应该如何附加到即将发送的html字符串中所有网址的末尾?我想像这样添加谷歌分析广告系列跟踪:
?utm_source=email&utm_medium=email&utm_campaign=product_notify

99%的网页不会以“.html”结尾,而某些网址可能已经包含?sr=1之类的内容。

6 个答案:

答案 0 :(得分:6)

更新到@ ircmaxell的答案,正则表达式现在匹配,即使前面的href&代码简化。

/**
 * @param string $body
 * @param string $campaign
 * @param string $medium
 * @return mixed
 */
protected function add_analytics_tracking_to_urls($body, $campaign, $medium = 'email') {
    return preg_replace_callback('#(<a.*?href=")([^"]*)("[^>]*?>)#i', function($match) use ($campaign, $medium) {
        $url = $match[2];
        if (strpos($url, '?') === false) {
            $url .= '?';
        } else {
            $url .= '&';
        }
        $url .= 'utm_source=' . $medium . '&utm_medium=' . $medium . '&utm_campaign=' . urlencode($campaign);
        return $match[1] . $url . $match[3];
    }, $body);
}

答案 1 :(得分:5)

嗯......你可以这样做:

function AppendCampaignToString($string) {
    $regex = '#(<a href=")([^"]*)("[^>]*?>)#i';
    return preg_replace_callback($regex, '_appendCampaignToString', $string);
}
function _AppendCampaignToString($match) {
    $url = $match[2];
    if (strpos($url, '?') === false) {
        $url .= '?';
    }
    $url .= '&utm_source=email&utm_medium=email&utm_campaign=product_notify';
    return $match[1].$url.$match[3];
}

这应该会自动找到页面上的所有链接(即使是外部链接,所以要小心)。的?检查只是确保我们在其上附加一个查询字符串...

编辑:解决了正则表达式无法正常工作的问题。

答案 2 :(得分:2)

<?php
$add = array(
 'utm_source'=>'email',
 'utm_medium'=>'email'
 'utm_campaign'=>'product_notify');
$doc = new DOMDocument();
$doc->loadHTML('your html');
foreach($doc->getElementsByTagName('a') as $link){
    $url = parse_url($link->getAttribute('href'));
    $gets = isset($url['query']) ? array_merge(parse_str($url['query'])) : $add;
    $newstring = '';
    if(isset($url['scheme'])) $newstring .= $url['scheme'].'://';
    if(isset($url['host']))   $newstring .= $url['host'];
    if(isset($url['port']))   $newstring .= ':'.$url['port'];
    if(isset($url['path']))   $newstring .= $url['path'];
    $newstring .= '?'.http_build_query($gets);
    if(isset($url['fragment']))   $newstring .= '#'.$url['fragment'];
    $link->setAttribute('href',$newstring);
 }
 $html - $doc->saveHTML();
 ?>

答案 3 :(得分:1)

这是我的解决方案,简单的问题,但使用

处理所有URL类型的相当复杂的解决方案
$campaign = (object)['utm_source' => 'email', 'utm_medium' => 'email', 'utm_campaign' => 'abc'];
$host = 'www.me.com';

$html = preg_replace_callback(
        '#(<a.*?href=["\']?)(?<href>https?://[^\s"\']+)(["\']?.*?>.*?</a>)#si', function ($matches) use ($campaign, $host) {
    $url = parse_url($matches['href']);
    // if (isset($url['host']) && $url['host'] !== $host) return $matches[0];
    parse_str(isset($url['query']) ? $url['query'] : '', $query);
    $query = array_merge(
        $query, array_filter(
                  [
                      'utm_source' => $campaign->utm_source,
                      'utm_medium' => $campaign->utm_medium,
                      'utm_term' => $campaign->utm_term,
                      'utm_content' => $campaign->utm_content,
                      'utm_campaign' => $campaign->utm_campaign,
                  ]
              )
    );
    return $matches[1] . // anchor part before url
    (isset($url['scheme']) ? $url['scheme'] . '://' : '') .
    (isset($url['user']) ? $url['user'] : '') .
    (isset($url['pass']) ? (isset($url['user']) ? ':' : '') . $url['pass'] : '') .
    (isset($url['user']) || isset($url['pass']) ? '@' : '').
    (isset($url['host']) ? $url['host'] : '') .
    (isset($url['port']) ? ':' . $url['port'] : '') .
    (isset($url['path']) ? $url['path'] : '') .
    '?' . http_build_query($query) .
    (isset($url['fragment']) ? '#' . $url['fragment'] : '') .
    $matches[3]; // anchor part after URL
}, $html
);

最后一部分(concat URL)也可以替换为http_build_url(),但您需要启用HTTP扩展。

代码在以下网址上进行了测试:

<a href="http://www.me.com">Lorem</a>
<a href="http://www.me.com/">ipsum</a>
<a href="http://www.me.com/#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php">sit</a>
<a href="http://www.me.com/?">amet</a>
<a href="http://www.me.com/?foo=bar">consectetur</a>
<a href="http://www.me.com/?foo=bar&bar=foo">consectetur</a>
<a href="http://www.NOTME.com?utm_source=XXX&utm_medium=XXX&utm_campaign=XXX">existing utm params</a>
<a href="http://user:password@www.me.com/?foo=bar#section-3">elit</a>
<a href="http://user:@www.me.com/?foo=bar#section-3">elit</a>
<a href="http://user@www.me.com?foo=bar#section-3">elit</a>

以下结果:

<a href="http://www.me.com?utm_source=email&utm_medium=email&utm_campaign=abc">Lorem</a>
<a href="http://www.me.com/?utm_source=email&utm_medium=email&utm_campaign=abc">ipsum</a>
<a href="http://www.me.com/?utm_source=email&utm_medium=email&utm_campaign=abc#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php?utm_source=email&utm_medium=email&utm_campaign=abc">sit</a>
<a href="http://www.me.com/?utm_source=email&utm_medium=email&utm_campaign=abc">amet</a>
<a href="http://www.me.com/?foo=bar&utm_source=email&utm_medium=email&utm_campaign=abc">consectetur</a>
<a href="http://www.me.com/?foo=bar&bar=foo&utm_source=email&utm_medium=email&utm_campaign=abc">consectetur</a>
<a href="http://www.NOTME.com?utm_source=email&utm_medium=email&utm_campaign=abc">existing utm params</a>
<a href="http://user:password@www.me.com/?foo=bar&utm_source=email&utm_medium=email&utm_campaign=abc#section-3">elit</a>
<a href="http://user:@www.me.com/?foo=bar&utm_source=email&utm_medium=email&utm_campaign=abc#section-3">elit</a>
<a href="http://user@www.me.com?foo=bar&utm_source=email&utm_medium=email&utm_campaign=abc#section-3">elit</a>

正如您所注意到的,如果您希望在parse_url()之后立即过滤主机名取消注释,我的代码适用于HTML中的所有链接(不仅仅是me.com)。

答案 4 :(得分:0)

您可以使用以下代码段将Google Analytics分析GET参数附加到当前脚本URI的现有参数。

function getQuery() {

 $url = parse_url($_SERVER['REQUEST_URI']);

 return $url['query'].'&utm_source=email&utm_medium=email&utm_campaign=product_notify';
}

答案 5 :(得分:0)

我的解决方案我已经建立了&amp;昨晚测试了:

我只匹配那些尚未拥有的链接&#34; utm _&#34;像查询参数,但包括与&#34; utm _&#34;作为路径的一部分:在查询params或另一个param名称的子串之前,例如&#34; xutm _&#34;。

为此,我使用了正负RegEx超前断言的组合(http://php.net/manual/en/regexp.reference.assertions.php

我还允许标签在href

之前和之后具有其他属性
$pattern = '/<a[^>]*href="(?=(.(?!(\?|&)utm_))*?>)[^"]*/i';

哪个匹配所有没有&#39;?utm _&#39;也不是&#39;&amp; utm _&#39;在href标签

然后我使用类回调函数解决方案,以便能够传递要追加的查询参数(作为回调的额外参数)

class link_params{
  private $parameters;    

  function __construct($params){
    $this->parameters = $params;
  }

  function callback($matches){
    return $matches[0] . (preg_match('/\\?[^"]/', $matches[0]) ? '&' : '?') . http_build_query($this->parameters);
  }
}

准备我想要添加到链接的查询参数:

$params_to_add = array(
    'utm_source' => 'newsletter-sep13',
    'utm_medium' => 'email',
    'utm_campaign' => 'product-X'
);

$callback_helper = new link_params($params_to_add);

最后,我应用preg_replace_callback函数,如下所示:

$html = preg_replace_callback($pattern, array($callback_helper, 'callback'), $html);