我有一个从网页上抓取数据的功能。我选择了应该抓取数据的标签,我可以得到结果。 function.php是这样的:
<meta http-equiv="Content-Type" content="text/HTML; charset=utf-8" />
<?php
function LoadCURLPage($url, $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4 Gecko/20030624 Netscape/7.1 (ax)",
$cookie = '', $referer = '', $post_fields = '', $return_transfer = 1,
$follow_location = 1, $ssl = '', $curlopt_header = 0)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
if($ssl)
{
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
}
curl_setopt ($ch, CURLOPT_HEADER, $curlopt_header);
if($agent)
{
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
}
if($post_fields)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
if($referer)
{
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
if($cookie)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
}
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);
$str = substr($string, $pos);
$str_two = substr($str, strlen($start));
$second_pos = stripos($str_two, $end);
$str_three = substr($str_two, 0, $second_pos);
$unit = trim($str_three); // remove whitespaces
return $unit;
}
?>
并且process.php就是这样:
<?php
error_reporting (E_ALL ^ E_NOTICE);
include 'function.php';
// Connect to this url using CURL
$url1 = 'http://www.remixon.com.tr/remixon.xml';
// Letâs use cURL to connect to the
$data1 = LoadCURLPage($url1);
// Extract information between STRING 1 & STRING 2
$string_one1 = '<SatisFiyati>';
$string_two1 = '</SatisFiyati>';
$info1 = extract_unit($data1, $string_one1, $string_two1);
$info1 = duzenL($info1);
echo $info1;
?>
此process.php仅回显第一个标签中的已删除数据。但是我在该网址中有30个相同的标签,我需要将它们全部刮掉。
如何在所有相同的&#34; SatisFiyati&#34;之间检索数据?和&#34; / SatisFiyati&#34;一个网址中的标签?
答案 0 :(得分:1)
使用DOMDocument从远程站点加载xml,而不是处理原始文本。然后,您可以提取与示例类似的所有elements by tagname:
<?php
include 'function.php';
// Connect to this url using CURL
$url1 = 'http://www.remixon.com.tr/remixon.xml';
$data1 = LoadCURLPage($url1);
$dom = new DOMDocument;
$dom->loadXML($data1);
$items = $dom->getElementsByTagName('SatisFiyati');
foreach ($items as $item) {
// do something with the data here
echo $item->nodeValue, PHP_EOL;
}
答案 1 :(得分:0)
您可以使用preg_match_all()
返回正则表达式的所有匹配项。
http://php.net/manual/en/function.preg-match-all.php
在您的情况下,您的函数extract_unit()
将类似于:
function extract_unit($string, $start, $end)
{
preg_match_all("/" . $start . "([^<]*)" . $end . "/", $string, $matches, PREG_PATTERN_ORDER);
return $matches[1];
}
$matches[0]
包含与完整模式匹配的字符串数组,$matches[1]
包含由标记包围的字符串数组。所以你实际上需要$matches[1]
。