如何让这个PHP程序更快?

时间:2011-07-18 23:25:24

标签: php performance

我构建了一个程序来循环访问单词并从www.dicsin.com.br获取他们的同义词,但这需要很长时间(字面意思),因为我的testfile.txt上有307k字,我该怎么办?请给我建议,我可以让它进行多进程或多线程,我不知道,我是PHP和编程的新手,不管怎样,谢谢,顺便说一句,这是我的完整工作代码:

<?
//Pega palavras do site: www.dicsin.com.br
pegarSinonimos("http://www.dicsin.com.br/content/dicsin_lista.php");

function pegaPalavras()
{
return file('testfile.txt');
}

function pegarSinonimos($url)
{
        $dicionario = pegaPalavras();
        $array_palavras = array();
        $array_palavras2 = array();
        $con = mysql_connect("localhost","root","whatever");
        if (!$con)
         {
          die('Could not connect: ' . mysql_error());
         }
        mysql_select_db("palavras2", $con);
        foreach($dicionario as $palavra)
        {
            $url_final = $url . "?f_pesq=" . $palavra;// . "&pagina=" . $pagina;

            $html = file_get_contents($url_final);

            $dom = new DOMDocument();
            $dom->loadHTML($html);

            $xpath = new DOMXPath($dom);
            $tags = $xpath->query('//div[@class="palavras_encontradas"]/div[@class="box_palavras_encontradas"]');
            foreach ($tags as $tag) 
            {
                $bla = $tag->nodeValue;
                $bla = utf8_decode($bla);
                $bla = str_replace("visualizar palavras", "", $bla);
                $bla = str_replace("(Sinônimo) ", "", $bla);//echo $bla;//array_push($array_palavras,$tag->nodeValue);
                $sql = "CREATE TABLE $palavra(sinonimo varchar(29))";
                mysql_query($sql,$con);
                mysql_query("INSERT INTO $palavra (sinonimo) VALUES ('$bla')");
            }
        }
        mysql_close($con);
}   
?>

4 个答案:

答案 0 :(得分:2)

开发一个哈希表并对其进行查找。这将实现O(1)恒定时间。

答案 1 :(得分:0)

如果您想使其成为多线程,您可以使用--enable-pcntl

使用PCNTL函数分叉流程

答案 2 :(得分:0)

1。复杂性

像FinalForm一样说complexity of your algorithm is too high(O(n ^ 2))。你应该避免循环内部循环(甚至在另一个循环内)。你应该总是计算算法的复杂性(在数学上很难做到)

低垂果

为了帮助你优化慢速部分,你应该只使用像xdebug / calgrind这样的工具解决你的低水果问题。我建议你从PHP创建者Rasmus观看这段视频“simple is hard”来学习这个概念。当你解决低挂水果时,你将获得最大的收益

Curl_multi

我认为真正缓慢的部分是你在阻止时做一个卷发(同时不能做任何其他事情)。我不认为其他循环需要花费那么多时间(与从远程主机获取相比,我认为这是你的低调水果)。您可以使用multi_curl来多路检索您的网址=&gt; http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/。这应该比阻止file_get_content

快得多

消息队列(MQ)

虽然这在共享主机上不可用(或者不太喜欢)。但是为了使您的网站真正快速,您应该使用MQ(例如redisbeanstalkd)离线处理您的负载。然后,您应该使用消息队列脱机处理每个单独的任务并回传片段。

答案 3 :(得分:0)

如果您使用最新的NaturePHP library for PHP 5.3+并安装了cURL,这应该会给您带来巨大的推动力:

<?php
include('nphp/init.php');

function pegaPalavras()
{
return file('testfile.txt');
}

//Pega palavras do site: www.dicsin.com.br
pegarSinonimos("http://www.dicsin.com.br/content/dicsin_lista.php");

function pegarSinonimos($url)
{
$dictionary = pegaPalavras();
$files = array();

$con = mysql_connect("localhost","root","blablabla");   //omg! a root pwd :O
if (!$con)
{
    die('Could not connect: ' . mysql_error());
}
mysql_select_db("palavras2", $con);


foreach($dictionary as $palavra)
{
    $files[] = $url . "?f_pesq=" . $palavra;// . "&pagina=" . $pagina;
}

//Http::multi_getcontents on NaturePHP makes use of curl_multi for parallel processing  and 
//fires callbacks asap, first come first serve style
//it should, however, take a lot on CPU and bandwith while processing
Http::multi_getcontents(

//uris
$files,

//process callback
function($url, $content){

    list(, $word) = explode('=', $content);

    $dom = new DOMDocument();
    $dom->loadHTML($content);

    $xpath = new DOMXPath($dom);
    $tags = $xpath->query('//div[@class="palavras_encontradas"]/div[@class="box_palavras_encontradas"]');

    //you create one table per word
    $sql = "CREATE TABLE $palavra(sinonimo varchar(29))";
    mysql_query($sql,$con);

    //if something was found
    if(count($tags)>0){

        //get an array with synonyms
        $synonyms=array();
        foreach ($tags as $tag) 
        {
            $synonyms[] = utf8_decode($tag->nodeValue);
        }

        //you can use str_replace on arrays, it's faster
        $synonyms = str_replace("visualizar palavras", "", $synonyms);
        $synonyms = str_replace("(Sinônimo) ", "", $synonyms);

        //a single insert query with all values is much faster
        $values = "('" . implode("'), ('", $synonyms) . "')";
        mysql_query("INSERT INTO $palavra (sinonimo) VALUES $values");

    }


});

mysql_close($con);
}

?>

这里没有真正测试过代码,所以可能存在mionor错误,但是你得到了一般概念;)

如果您没有php 5.3+,可以查看NaturePHP上有关如何使用curl_multi的源代码。

PS:你可能想改变你的根pwd:x