使用PHP消除字符串中的HTML转义序列

时间:2014-04-16 09:02:10

标签: php string

我正在开发一个php脚本来从wowhead中提取任务数据,特别是开始和结束任务的内容,无论是项目还是npc,以及它的id或名称分别是什么。这是整个脚本的相关部分,其余部分涉及数据库插入。如果有人感兴趣的话,这是我提出的完整代码片段。此外,看到这将运行大约15,000次,这是获取/存储数据的最佳方法吗?

<?php

$quests = array();
//$questlimit = 14987;
$questlimit = 5;
$currentquest = 1;
$questsprocessed = 0;
while($questsprocessed != $questlimit)
{
echo "<br>";
echo "  Start of iteration: ".$questsprocessed."  ";
echo "<br>";
echo "  Attempting to process quest: ".$currentquest."  ";
echo "<br>";

$quests[$currentquest] = array();
$baseurl = 'http://wowhead.com/quest=';
$fullurl = $baseurl.$currentquest;

$data = drupal_http_request($fullurl);

$queststartloc1 = strpos($data->data, 'quest_start'); 
$queststartloc2 = strpos($data->data, 'quest_end');

if($queststartloc1==false)
{$currentquest++; echo "No data for this quest"; echo "<br>"; continue;}


$questendloc1 = strpos($data->data, 'quest_end');
$questendloc2 = strpos($data->data, 'x5DDifficulty');

$startcaptureLength = $queststartloc2 - $queststartloc1;
$endcaptureLength = $questendloc2 - $questendloc1;


$quest_start_raw = substr($data->data,$queststartloc1, $startcaptureLength);
$quest_end_raw = substr($data->data, $questendloc1, $endcaptureLength);

$startDecoded = preg_replace('~\\\\x([A-Fa-f0-9]{2})~e', 'chr("0x$1")', $quest_start_raw);
$endDecoded = preg_replace('~\\\\x([A-Fa-f0-9]{2})~e', 'chr("0x$1")', $quest_end_raw);
$quests[$currentquest]['Start'] = array();
$quests[$currentquest]['End'] = array();

if(strstr($startDecoded, 'npc'))
  {
   $quests[$currentquest]['Start']['Type'] = "npc";
  preg_match('~npc=(\d+)~', $startDecoded, $startmatch);
  }
else
{
  $quests[$currentquest]['Start']['Type'] = "item";
  preg_match('~item=(\d+)~', $startDecoded, $startmatch);
}


$quests[$currentquest]['Start']['ID'] = $startmatch[1];


if(strstr($endDecoded, 'npc'))
  {
   $quests[$currentquest]['End']['Type'] = "npc";
  preg_match('~npc=(\d+)~', $endDecoded, $endmatch);
  }
else
{
  $quests[$currentquest]['End']['Type'] = "item";
  preg_match('~item=(\d+)~', $endDecoded, $endmatch);
}


$quests[$currentquest]['End']['ID'] = $endmatch[1];

//var_dump($quests[$currentquest]);

echo "  End of iteration: ".$questsprocessed."  ";
echo "<br>";
echo "  Processed quest: ".$currentquest."  ";
echo "<br>";
$currentquest++;
$questsprocessed++;

}
?>

1 个答案:

答案 0 :(得分:3)

这些被称为&#34;转义序列&#34;。通常,它们用于表示不可打印的字符,但可以编码任何字符。在php中,你可以像这样解码它们:

$text = '
quest_start\\x5DStart\\x3A\\x20\\x5Bitem\\x3D16305\\x5D\\x5B\\x2Ficon\\x5D\\x5B\\x2Fli\\x5D\\x5Bli\\x5D\\x5Bicon\\x20name\\x3Dquest_end\\x5DEnd\\x3A\\x20\\x5Burl\\x3D\\x2Fnpc\\x3D12696\\x5DSenani\\x20Thunderheart\\x5B\\x2Furl\\x5D\\x5B\\x2Ficon\\x5D\\x5B\\x2Fli\\x5D\\x5Bli\\x5DNot\\x20sharable\\x5B\\x2Fli\\x5D\\x5Bli
';

$decoded = preg_replace('~\\\\x([A-Fa-f0-9]{2})~e', 'chr("0x$1")', $text);

它为您提供了与此类似的字符串:

 quest_start]Start: [item=16305][/icon][/li][li][icon name=quest_end]End: [url=/npc=12696]Senani Thunderheart[/url][/icon][/li][li]Not sharable[/li][li

(显然,某种BB代码)。要删除所有bbcodes,必须进行一次替换:

$clean = preg_replace('~(\[.+?\])+~', ' ', $decoded);