如何在以下场景中检查并更正数组元素中的无效HTML?

时间:2015-03-12 12:04:16

标签: php html arrays dom xml-parsing

我有一个标题为$comments的数组如下:

Array
(
    [0] => Array
        (
            [text] => Second Comment Added                
        )

    [1] => Array
        (
            [text] => This is the long comment added to check thwe size of the comment on the device,if the size is more then add the hyperlink button to go on to the next page
        )

    [2] => Array
        (
            [text] => This comment is of two lines need to check more about it                
        )

    [3] => Array
        (
            [text] => This comment is of two lines need to check more                
        )

    [4] => Array
        (
            [text] => Uploading Photo  for comment <div title="comment_attach_image">

<a title="" title="colorbox" href="https://www.filepicker.io/api/file/CnYTVQdATAOQTkMxpAq4" ><img src="https://www.filepicker.io/api/file/CnYTVQdATAOQTkMxpAq4" height="150px" width="150px" /></a>

<a href="https://www.filepicker.io/api/file/CnYTVQdATAOQTkMxpAq4" class="comment_attach_image_link_dwl">Download</a>

</div>                
        )

    [5] => Array
        (
            [text] => test                
        )

    [6] => Array
        (
            [text] => Amit&#039;s pic<div class="comment_attach_image">
            <a class="group1 cboxElement" href="http://52.1.47.143/file/attachment/2015/03/e55f0f3080eb9828270a7963648a5826.jpeg" ><img src="http://52.1.47.143/file/attachment/2015/03/e55f0f3080eb9828270a7963648a5826.jpeg" height="150px" width="150px" /></a>

            <a class="comment_attach_image_link_dwl"  href="http://52.1.47.143/feed/download/year_2015/month_03/file_e55f0f3080eb9828270a7963648a5826.jpeg" >Download</a>
            </div>
        )

    [7] => Array
        (
            [text] => PDF file added<div class="comment_attach_file">
            <a class="comment_attach_file_link" href="http://52.1.47.143/feed/download/year_2015/month_03/file_1b87d4420c693f2bbdf738cbf2457d89.pdf" >1b87d4420c693f2bbdf738cbf2457d89.pdf</a>

            <a class="comment_attach_file_link_dwl"  href="http://52.1.47.143/feed/download/year_2015/month_03/file_1b87d4420c693f2bbdf738cbf2457d89.pdf" >Download</a>
            </div>                
        )

    [8] => Array
        (
            [text] => Just did it...                
        )

    [9] => Array
        (
            [text] => Akki <div title="comment_attach_image">

<a title="" title="colorbox" href="https://www.filepicker.io/api/file/NJqijbKTIOA0ZJBNknsm" ><img src="https://www.filepicker.io/api/file/NJqijbKTIOA0ZJBNknsm" height="150px" width="150px" /></a>

<a href="https://www.filepicker.io/api/file/NJqijbKTIOA0ZJBNknsm" class="comment_attach_image_link_dwl">Download</a>

</div>                
        )

) 

在此数组中,两个元素包含无效的HTML。 $comments[4] and $comments[9]包含无效的HTML,因为我无法用XML解析它。

我想要清理后的这两个元素如下(其他元素应该保持相同)。所有数组键都应完好无损:

Array
    (
[4] => Array
            (
                [text] => Uploading Photo  for comment <div class="comment_attach_image">

    <a title="colorbox" href="https://www.filepicker.io/api/file/CnYTVQdATAOQTkMxpAq4" ><img src="https://www.filepicker.io/api/file/CnYTVQdATAOQTkMxpAq4" height="150px" width="150px" /></a>

    <a href="https://www.filepicker.io/api/file/CnYTVQdATAOQTkMxpAq4" class="comment_attach_image_link_dwl">Download</a>

    </div>                
            )  
[9] => Array
            (
                [text] => Akki <div class="comment_attach_image">

    <a title="colorbox" href="https://www.filepicker.io/api/file/NJqijbKTIOA0ZJBNknsm" ><img src="https://www.filepicker.io/api/file/NJqijbKTIOA0ZJBNknsm" height="150px" width="150px" /></a>

    <a href="https://www.filepicker.io/api/file/NJqijbKTIOA0ZJBNknsm" class="comment_attach_image_link_dwl">Download</a>

    </div>                
            )

    ) 

如果您发现<div title="comment_attach_image">已被更改为<div class="comment_attach_image">,并且已删除包含空白值的额外title属性。

如何检查这个无效的HTML并在PHP中正确使用它?

提前致谢。

以下是我的解析代码:

foreach($comments as $key=>$comment) {
    $text = strstr($comment['text'], '<div');
    if (strlen($text) <= 0) {
      $comments[$key]['type_id'] =  'text';
      $comments[$key]['url'] =  '';
      $comments[$key]['text'] =  $comment['text'];
    } else if($xml = @simplexml_load_string($text)) { 
      $comments[$key]['type_id'] =  substr(strrchr($xml['class'], '_'), 1);
      $comments[$key]['url'] = str_replace(array('href=','"'), '',$xml->a['href']->asXML());
      $comments[$key]['text'] =  strtok($comment['text'], '<');           
    } else {
      continue;
    }    
  }

1 个答案:

答案 0 :(得分:0)

试试这个..

$original=array('<div title="comment_attach_image">','title=""');
$changedText=array('<div class="comment_attach_image">','');
str_replace($original,$changedText,$string);

它会将title替换为class,将title=""替换为NULL ...