尝试将HTML转换为XML时链接出现问题

时间:2009-10-24 04:10:13

标签: php html xml hyperlink

我正在尝试将html文件转换为xml。它在大多数情况下都有效。我遇到的问题是链接。现在它似乎完全忽略了我的测试文件中的链接。

以下是转换代码:

<?php
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL);

function convertToXML()
{

    $titleLength = 35;
    $output = "";
    $date = date("D, j M Y G:i:s T");
    $fi = fopen( "../newsTEST.htm", "r" );
    $fo = fopen( "../newsfeed.xml", "w" );

    //This is the first parts of the XML
    $output .= "<?xml version=\"1.0\"?>\n";
    $output .= "<rss version=\"2.0\">\n";
    $output .= "<channel>\n";
    $output .= "\t<title>Wiggle 100 News</title>\n";
    $output .= "\t<link>http://www.wiggle100.com/news.php</link>\n";
    $output .= "\t<description>Wiggle 100 Daily News</description>\n";
    $output .= "\t<language>en-us</language>\n";
    $output .= "\t<pubDate>". $date ."</pubDate>\n";
    $output .= "\t<managingEditor>wiggle100@gmail.com</managingEditor>\n";
    $output .= "\t<webMaster>josh@jacurren.com</webMaster>\n";

    $article = "";
    $skip = true; //if false will continue to put lines into output until </p>
    $newArticle = false;

    while( !feof($fi) )
    {
        $line = fgets($fi);
        $link = "";

        if( strpos( $line, "<p" ) !== false)
        {
            $pos = strpos( $line, "<p" );
            $line = substr( $line, $pos );

            $pos = strpos( $line, ">" );
            $line = substr( $line, $pos + 1 );

            $skip = false;          
        }

        if( strpos( $line, "</p>" ) !== false )
        {
            $pos = strpos( $line, "</p>" );
            $line = substr( $line, 0, $pos - 1 );

            $newArticle = true;
        }

        //This adds the line to the article
        if( !$skip )
        {
            $article .= $line;
        }

        //This mixes the article, title, link, and date with 
        // XML and puts it into the output
        if( $newArticle )
        {
            //This if is to get rid of stuff like <p>&nbsp;</p>
            if( (strlen($article) > 10) )
            {
                $link = findLink( $article );
                //$article = strip_tags($article);
                $title = substr( $article, 0, $titleLength ) . "...";

                $output .= "\t<item>\n";
                $output .= "\t\t<title>". $title ."</title>\n";
                $output .= "\t\t<link>". $link ."</link>\n";
                $output .= "\t\t<description>". $article . "</description>\n";
                $output .= "\t\t<pubDate>". $date . "</pubDate>\n";
                $output .= "\t</item>\n\n";
            }

            $article = "";
            $line = "";
            $skip = true;
        }
    }

    $output .= "</channel>\n";
    $output .= "</rss>\n";

    fwrite( $fo, $output );

    fclose($fi);
    fclose($fo);

    echo "<br /><br /> News converted to XML";
}

    //*****************************************************************************
    //*****************************************************************************

    //Find and return a link in the input.
    //Else use the a default
    function findLink( $input )
    {   
        $link = "http://www.wiggle100.com/news.php";

        if( strpos( $input, "<a" ) !== false )
        {
            $startpos = strpos( $input, "href" );
            $link = substr( $input, $startpos + 5 );
            $endpos = strpos( $link, ">" );
            $link = substr( $link, 0, $endpos - 2 );
        }
        return $link;
    }


?>

这是html测试代码:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> 
<a href="http://www.thedailyreview.com/news/"> 
http://www.thedailyreview.com/news/</a></p> 
</body> 
</html> 

这是XML输出:

<rss version="2.0"> 
<channel> 
    <title>Wiggle 100 News</title> 
    <link>http://www.wiggle100.com/news.php</link> 
    <description>Wiggle 100 Daily News</description> 
    <language>en-us</language> 
    <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    <managingEditor>wiggle100@gmail.com</managingEditor> 
    <webMaster>josh@jacurren.com</webMaster> 
    <item> 
        <title>This is an article. Blah. Blah. Bla...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title>This is another article. Blah. Blah...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title>This is the 3rd article. Blah. Blah...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
        <title><font size="6">This is the news for...</title> 
        <link>http://www.wiggle100.com/news.php</link> 
        <description><font size="6">This is the news for today. Blah Blah Blah!</font> 
</description> 
        <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

</channel> 
</rss> 

取消注释strip_tags()时,字体标记将消失。

2 个答案:

答案 0 :(得分:1)

我做了一些测试,发现它在输入文件中单行上的段落上工作正常,如下例所示。 (除非它将开头的引号标记为URL的一部分,但很容易修复。)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"> http://www.thedailyreview.com/news/</a></p> 
</body> 
</html>

答案 1 :(得分:0)

问题最终是我在写入xml输出后从未将$ newArticle重置为false。因此,在$ newArticle设置为true(即找到</p>时)之后,在输出文章之前,永远不会有多行读取。通过在写入输出后将$ newArticle设置为false,程序会正确地向文章添加行,直到遇到</p>