我构建了一个程序来循环访问单词并从www.dicsin.com.br获取他们的同义词,但这需要很长时间(字面意思),因为我的testfile.txt上有307k字,我该怎么办?请给我建议,我可以让它进行多进程或多线程,我不知道,我是PHP和编程的新手,不管怎样,谢谢,顺便说一句,这是我的完整工作代码:
<?
//Pega palavras do site: www.dicsin.com.br
pegarSinonimos("http://www.dicsin.com.br/content/dicsin_lista.php");
function pegaPalavras()
{
return file('testfile.txt');
}
function pegarSinonimos($url)
{
$dicionario = pegaPalavras();
$array_palavras = array();
$array_palavras2 = array();
$con = mysql_connect("localhost","root","whatever");
if (!$con)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("palavras2", $con);
foreach($dicionario as $palavra)
{
$url_final = $url . "?f_pesq=" . $palavra;// . "&pagina=" . $pagina;
$html = file_get_contents($url_final);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[@class="palavras_encontradas"]/div[@class="box_palavras_encontradas"]');
foreach ($tags as $tag)
{
$bla = $tag->nodeValue;
$bla = utf8_decode($bla);
$bla = str_replace("visualizar palavras", "", $bla);
$bla = str_replace("(Sinônimo) ", "", $bla);//echo $bla;//array_push($array_palavras,$tag->nodeValue);
$sql = "CREATE TABLE $palavra(sinonimo varchar(29))";
mysql_query($sql,$con);
mysql_query("INSERT INTO $palavra (sinonimo) VALUES ('$bla')");
}
}
mysql_close($con);
}
?>
答案 0 :(得分:2)
开发一个哈希表并对其进行查找。这将实现O(1)恒定时间。
答案 1 :(得分:0)
如果您想使其成为多线程,您可以使用--enable-pcntl
答案 2 :(得分:0)
像FinalForm一样说complexity of your algorithm is too high(O(n ^ 2))。你应该避免循环内部循环(甚至在另一个循环内)。你应该总是计算算法的复杂性(在数学上很难做到)
为了帮助你优化慢速部分,你应该只使用像xdebug / calgrind这样的工具解决你的低水果问题。我建议你从PHP创建者Rasmus观看这段视频“simple is hard”来学习这个概念。当你解决低挂水果时,你将获得最大的收益
我认为真正缓慢的部分是你在阻止时做一个卷发(同时不能做任何其他事情)。我不认为其他循环需要花费那么多时间(与从远程主机获取相比,我认为这是你的低调水果)。您可以使用multi_curl来多路检索您的网址=&gt; http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/。这应该比阻止file_get_content
虽然这在共享主机上不可用(或者不太喜欢)。但是为了使您的网站真正快速,您应该使用MQ(例如redis或beanstalkd)离线处理您的负载。然后,您应该使用消息队列脱机处理每个单独的任务并回传片段。
答案 3 :(得分:0)
如果您使用最新的NaturePHP library for PHP 5.3+并安装了cURL,这应该会给您带来巨大的推动力:
<?php
include('nphp/init.php');
function pegaPalavras()
{
return file('testfile.txt');
}
//Pega palavras do site: www.dicsin.com.br
pegarSinonimos("http://www.dicsin.com.br/content/dicsin_lista.php");
function pegarSinonimos($url)
{
$dictionary = pegaPalavras();
$files = array();
$con = mysql_connect("localhost","root","blablabla"); //omg! a root pwd :O
if (!$con)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("palavras2", $con);
foreach($dictionary as $palavra)
{
$files[] = $url . "?f_pesq=" . $palavra;// . "&pagina=" . $pagina;
}
//Http::multi_getcontents on NaturePHP makes use of curl_multi for parallel processing and
//fires callbacks asap, first come first serve style
//it should, however, take a lot on CPU and bandwith while processing
Http::multi_getcontents(
//uris
$files,
//process callback
function($url, $content){
list(, $word) = explode('=', $content);
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[@class="palavras_encontradas"]/div[@class="box_palavras_encontradas"]');
//you create one table per word
$sql = "CREATE TABLE $palavra(sinonimo varchar(29))";
mysql_query($sql,$con);
//if something was found
if(count($tags)>0){
//get an array with synonyms
$synonyms=array();
foreach ($tags as $tag)
{
$synonyms[] = utf8_decode($tag->nodeValue);
}
//you can use str_replace on arrays, it's faster
$synonyms = str_replace("visualizar palavras", "", $synonyms);
$synonyms = str_replace("(Sinônimo) ", "", $synonyms);
//a single insert query with all values is much faster
$values = "('" . implode("'), ('", $synonyms) . "')";
mysql_query("INSERT INTO $palavra (sinonimo) VALUES $values");
}
});
mysql_close($con);
}
?>
这里没有真正测试过代码,所以可能存在mionor错误,但是你得到了一般概念;)
如果您没有php 5.3+,可以查看NaturePHP上有关如何使用curl_multi的源代码。
PS:你可能想改变你的根pwd:x