提取隐藏在div标签中的锚点值

时间:2012-01-23 07:07:26

标签: php html

从html页面我需要从所有锚链接中提取v的值...每个锚链接隐藏在大约5个div标签中

<a href="/watch?v=value to be retrived&amp;list=blabla&amp;feature=plpp_play_all">

每个v值有11个字符,因此截至目前我正在尝试逐字符地读取它,如

<?php
$file=fopen("xx.html","r") or exit("Unable to open file!");
$d='v';
$dd='=';
$vd=array();
while (!feof($file))
  {
  $f=fgetc($file); 
  if($f==$d)
  {
  $ff=fgetc($file);
  if ($ff==$dd)
  { 
  $idea='';
  for($i=0;$i<=10;$i++)
  {      
$sData = fgetc($file);
$id=$id.$sData;
  }      
  array_push($vd, $id);

这是获取v的每个字符并将其存储在sData变量中并将其推入id以便将这11个字符作为字符串(id)... 问题是...通过整个html文件搜索'v ='如果发现读取11个字符并将其推入sData数组就是吸吮,这需要花费相当多的时间...所以请帮助我复杂化这些东西

2 个答案:

答案 0 :(得分:2)

<?php
function substring(&$string,$start,$end)
{
    $pos = strpos(">".$string,$start);
    if(! $pos) return "";
    $pos--;
    $string = substr($string,$pos+strlen($start));
    $posend = strpos($string,$end);
    $toret = substr($string,0,$posend);
    $string = substr($string,$posend);
    return $toret;
}
$contents = @file_get_contents("xx.html");

$old="";
$videosArray=array();
while ($old <> $contents)
{
$old = $contents;
$v = substring($contents,"?v=","&");
if($v) $videosArray[] = $v;
}

//$videosArray is array of v's
?>

答案 1 :(得分:0)

我最好使用SimpleXML和XPath解析HTML:

// Get your page HTML string
$html = file_get_contents('xx.html');

// As per comment by Gordon to suppress invalid markup warnings
libxml_use_internal_errors(true);

// Create SimpleXML object
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);

// Find a nodes
$anchors = $xml->xpath('//a[contains(@href, "v=")]');

foreach ($anchors as $a)
{
    $href = (string)$a['href'];
    $url = parse_url($href);
    parse_str($url['query'], $params);

    // $params['v'] contains what we need
    $vd[] = $params['v']; // push into array
}

// Clear invalid markup error buffer
libxml_clear_errors();