我有一个网址列表,可以采用任何格式。每行一个,以逗号分隔,在它们之间有随机文本等。这些URL都来自2个不同的站点,并且具有相似的结构
对于这个例子,让我们说它看起来像这个
Random Text - http://www.domain2.com/variable-value
Random Text 2 - http://www.domain1.com/variable-value, http://www.domain1.com/variable-value, http://www.domain1.com/variable-value
http://www.domain1.com/variable-value
http://www.domain2.com/variable-value
http://www.domain1.com/variable-value http://www.domain2.com/variable-value http://www.domain1.com/variable-value
我需要提取2条信息。检查其domain1或domain2以及
所以它应该创建一个多维数组,它有2个项目:domain + value。
最好的方法是什么?
答案 0 :(得分:1)
这是提取网址的可能性。唯一的问题是网址本身可能不包含逗号。所以如果够了......
$lines = explode('\n', $urls);
for($i = 0; $i < sizeof($lines); $i++)
{
if(preg_match_all("http:\\/\\/[^,]*variable-([^,]+)", $lines[$i], $matches))
{
}
}
顺便说一句......匹配存储在$matches
数组中。
P.P.S:经过进一步研究后,我找到了这个页面:http://internet.ls-la.net/folklore/url-regexpr.html。它包含url的正则表达式。您可以先使用它来提取网址,然后在第二步中,您可以浏览网址并提取可变信息,例如variable-([\W]+)
。
答案 1 :(得分:0)
preg_split,preg_match,parse_url
// split urls
$urls = preg_split('!,\s+!', 'http://www.domain1.com/variable-value, http://www.domain2.com/variable-value, http://www.domain3.com/variable-value');
// check for domain and path variable
foreach ($urls as $url) {
$parts = parse_url($url);
// check domain: $parts['host'];
$matches = array();
// check path: preg_match('!^/variable-([^/]+)!', $parts['path'], $matches)
}
答案 2 :(得分:0)
$text = "http://www.domain1.com/variable-value1, http://www.domain2.com/variable-value2 http://www.domain1.com/variable-value3";
preg_match_all("/http:\\/\\/(.+?)\\/variable-([a-z0-9]+)/si", $text, $matches);
print_r($matches);
结果:
Array
(
[0] => Array
(
[0] => http://www.domain1.com/variable-value1
[1] => http://www.domain2.com/variable-value2
[2] => http://www.domain1.com/variable-value3
)
[1] => Array
(
[0] => www.domain1.com
[1] => www.domain2.com
[2] => www.domain1.com
)
[2] => Array
(
[0] => value1
[1] => value2
[2] => value3
)
)