我正在使用PHP 7.4.1
。
我正在尝试从Google解析RSS提要。
我的链接如下所示:
https://www.google.com/url?rct=j&sa=t&url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&ct=ga&cd=CAIyGjRm
https://www.google.com/url?rct=j&sa=t&url=https://www.politifact.com/factchecks/2020/oct/31/raphael-warnock/fact-checking-raphael-warnocks-claim-georgia-sen-k/&ct=ga&cd=CAIyGm
https://www.google.com/url?rct=j&sa=t&url=https://www.benzinga.com/news/20/10/18156683/last-weeks-notable-insider-buys-ibm-intel-raytheon-and-more&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-avino-silver-gold-mines-ltd-nyseasm-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5Y
https://www.google.com/url?rct=j&sa=t&url=https://www.businessinsider.co.za/who-received-an-sms-from-markus-jooste-2020-10&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MDY6Y29tOmVuOlVT&am
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-veritone-inc-nasdaqveri-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2M
https://www.google.com/url?rct=j&sa=t&url=https://heavy.com/sports/las-vegas-raiders/jj-watt-stephon-gilmore-trade-targets/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MDY6Y29tOmVuOlVT&a
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-truecar-inc-nasdaqtrue-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MD
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-veeco-instruments-inc-nasdaqveco-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRl
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-21vianet-group-inc-nasdaqvnet-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU
我想从url=
获取真实链接,并切掉结尾/&ct=ga&cd=CAIyGjRm
。
我尝试过str_replace
,但由于结果不同,很难解析出结果。
关于如何获得链接的任何建议?
答案 0 :(得分:2)
在没有合法/本机/可靠的技术来解析文本时,正则表达式是合适的。
PHP提供了本机函数来解析url和查询字符串。
以下代码段涉及多个本机函数,并且将比正则表达式执行得慢,但是当您的外部数据源重新配置其查询字符串数据时,中断的可能性也将大大降低。例如,如果它们添加了附加参数rawurl=
,则regex容易将它们错误地匹配。在使用合法的解析技术还是使用正则表达式(在url,有效的html,bbcode等上)之间的争论太普遍了,但是开发人员的主要目标应该始终是数据完整性。如果您要处理大量数据,并且只有实际上提高速度,才能为执行速度牺牲数据完整性,从而为您的系统/最终用户带来宝贵的利益。如果您发现没有合理理由偏向于微优化解决方案,我建议您不要喝这种酷似的帮助。
这是解析网址并提取url
值的一种方式。
代码:(Demo)
$url = 'https://www.google.com/url?rct=j&sa=t&url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&ct=ga&cd=CAIyGjRm';
parse_str(
htmlspecialchars_decode(
parse_url(
$url,
PHP_URL_QUERY
)
),
$parts
);
echo $parts['url'];
输出:
https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/
我超级喜欢正则表达式,但不是每一项任务都喜欢。在这里避免使用正则表达式将使您的脚本更具可读性,可靠性和易于维护性。
答案 1 :(得分:1)
您可以在preg_match_all
中使用此正则表达式:
(?<=url=)https?:\S+?(?=&|$)
RegEx详细信息:
(?<=url=)
:如果我们在当前位置之前有url=
https?:\S+?
:匹配以http:
或https:
开头的URL (?=&|$)
:如果我们有&
或行尾在当前位置之后代码:
php > $s = "https://www.google.com/url?rct=j&sa=t&url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&ct=ga&cd=CAIyGjRm";
php > preg_match_all('~(?<=url=)https?:\S+?(?=&|$)~', $s, $m);
php > print_r($m[0]);
Array
(
[0] => https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/
)