Question

我有两张桌子。在表1中，我有大约400K行，其中每行包括一段最多50个句子的文本。在表2中，我有一个80k字的词典，每个段落的每个单词的编码都需要一个分数。

我的php脚本的重点是将每段文本分解为所需的单词，然后在每个单词的词典中查找得分，最后计算所有单词的总得分。每一行。

到目前为止，我的策略是制作一个执行以下操作的脚本：

连接数据库，表1
While Loop，一行接一行
对于当前行，请展开段落。
对于每个单词，如果单词存在，请查看表2并返回分数。
结束当前行的总分。
使用当前段落的总分更新表1.
回到第2点。

我的代码有效，但效率不高。问题是脚本是如此之慢，让它运行一个小时只计算前500行。这是一个问题，因为我有400K行。我将需要这个脚本用于其他项目。

你建议我做些什么来减少这个过程？

<?php

//Include functions
        include "functions.php";
        ini_set('max_execution_time', 9000000);
        echo 'Time Limit = ' . ini_get('max_execution_time');
        $db='senate';   
//Function to search into the array lexicon     
        function searchForId($id, $array) {
        foreach ($array as $key2 => $val) {
        if ($val['word'] === $id) {
        return $key2;
       } 
   }
   return null;
}       

// tags to remove
        $remove   = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}');       
        $x=1;
//Conecting the database
        if (!$conn) {
        die('Not connected : ' . mysql_error());}


// Choose the current db
        mysql_select_db($db);

//Slurps the lexicon into an array
$sql = "SELECT word, score FROM concreteness";
$resultconcreteness = mysql_query($sql) or die(mysql_error());
$array = array();
while($row = mysql_fetch_assoc($resultconcreteness)) {
$array[] = $row;
}

//loop      
        while($x<=500000) {
        $data = mysql_query("SELECT `key`, `tagged` FROM speechesLCMcoded WHERE `key`='$x'") or die(mysql_error());

// puts the "data" info into the $info array 
        $info = mysql_fetch_array( $data);
        $tagged=$info['tagged'];
        unset($weight);
        unset($count);
        $weight=0;
        $count=0;

// Print out the contents of the entry 
        Print "<b>Key:</b> ".$info['key'] .  " <br>";

// Explodes the sentence
        $speech = explode(" ", $tagged);

// Loop every word  
        foreach($speech as $word) {

//Check if string contains our tag

if(!preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {} else{

//Removes our tags
        $word = str_replace($remove, "", $word);

        $id = searchForId($word, $array);
//      print "ID: " . $id . "<br>";
//      print "Word: " . $array[$id]['word'] . "<br>";
//      print "Score: " . $array[$id]['score'] . "<br>";
        $weight=$weight+$array[$id]['score'];
        $count=$count +1;
//      print "Weight: " . $weight . "<br>";
 //     print "Count: " . $count . "<br>";
        }
}
        $sql = "UPDATE speechesLCMcoded SET weight='$weight', count='$count' WHERE `key`='$x';" ;
        $retval = mysql_query( $sql, $conn );
        if(! $retval )
        {die('Could not update data: ' . mysql_error());}
        echo "Updated data successfully\n";
        ob_flush();
        flush();   

        //Increase the loop by one
        $x=$x+1;

}?>

这是索引：

CREATE TABLE `speechesLCMcoded` (
 `key` int(11) NOT NULL AUTO_INCREMENT,
 `speaker_state` varchar(100) NOT NULL,
 `speaker_first` varchar(100) NOT NULL,
 `congress` varchar(100) NOT NULL,
 `title` varchar(100) NOT NULL,
 `origin_url` varchar(100) NOT NULL,
 `number` varchar(100) NOT NULL,
 `order` varchar(100) NOT NULL,
 `volume` varchar(100) NOT NULL,
 `chamber` varchar(100) NOT NULL,
 `session` varchar(100) NOT NULL,
 `id` varchar(100) NOT NULL,
 `raw` mediumtext NOT NULL,
 `capitolwords_url` varchar(100) NOT NULL,
 `speaker_party` varchar(100) NOT NULL,
 `date` varchar(100) NOT NULL,
 `bills` varchar(100) NOT NULL,
 `bioguide_id` varchar(100) NOT NULL,
 `pages` varchar(100) NOT NULL,
 `speaker_last` varchar(100) NOT NULL,
 `speaker_raw` varchar(100) NOT NULL,
 `tagged` mediumtext NOT NULL,
 `adjectives` varchar(10) NOT NULL,
 `verbs` varchar(10) NOT NULL,
 `nouns` varchar(10) NOT NULL,
 `weight` varchar(50) NOT NULL,
 `count` varchar(50) NOT NULL,
 PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=408344 DEFAULT CHARSET=latin1

Answer 1

你有一个相当小的参考表（你的词典）和一个巨大的文本语料库（表1）。

如果我是你，我会通过将整个词典从表中啜饮到内存中的php数组来启动你的程序。即使所有单词的长度都是20个字符，这也只需要十几个兆字节的RAM。

然后通过查找内存中的单词而不是使用SQL查询来执行第4步。你的内循环（对于每个单词）将更快，同样准确。

但是，小心一件事。如果要复制MySQL的不区分大小写的查找行为，则需要通过将它们转换为小写来规范化词典中的单词。

看到您的代码后进行修改

一些专业提示：

正确缩进代码，以便您可以一目了然地查看循环的结构。
请记住，将数据传递给函数需要时间。
PHP数组是关联。你可以做$value = $array[$key]。这很快。您不必线性搜索数组。你这样做每个单词 !!
准备好的陈述很好。
当您从结果集中读取下一行时重复SQL语句是错误的。
流式结果集很好。
mysql_函数调用被他们的开发人员和其他所有人弃用和鄙视，原因很充分。

你的循环中有太多的事情发生了。

你需要的是：

首先，使用mysqli_接口切换到使用mysql_。去做就对了。 mysql_太慢，太旧，太苛刻。

$db = new mysqli("host", "user", "password", "database");

其次，改变你加载词典的方式，优化整个关联数组的处理。

$lookup = array();
//Slurps the lexicon into an array, streaming it row by row
$sql = "SELECT word, score FROM concreteness";
$db->real_query($sql) || die $db->error;
$lkup = $db->use_result();
while ($row = $lkup->fetch_row()) {
      $lookup[strtolower($row[0])] = $row[1];
}
$lkup->close();

这为您提供了一个名为$lookup的关联数组。如果您有$word，则可以通过这种方式找到其权重值。这很快。您的示例代码中的内容非常慢。请注意，在创建键和查找单词时，键都会转换为小写。出于性能原因，如果可以避免，请不要将其放入功能中。

if (array_key_exists( strtolower($word), $lookup )) {
    $weight += $lookup[strtolower($word)]; /* accumulate weight */
    $count ++;                             /* increment count   */
}
else {
  /* the word was not found in your lexicon. handle as needed */
}

最后，您需要优化查询文本语料库的行及其更新。我相信你应该使用准备好的陈述来做到这一点。

这是怎么回事。

在程序开头附近，放置此代码。

$previouskey = -1;
if (/* you aren't starting at the beginning */) {
   $previouskey = /* the last successfully processed row */
}

$get_stmt = $db->prepare('SELECT `key`, `tagged` 
                           FROM speechesLCMcoded 
                          WHERE `key` > ?
                          ORDER BY `key` LIMIT 1' );

$post_stmt = $db->prepare ('UPDATE speechesLCMcoded 
                               SET weight=?, 
                                   count=? 
                             WHERE `key`=?' );

这些为您提供两个即用型语句供您处理。

请注意，$get_stmt会检索您尚未处理的第一个key。即使您有一些丢失的密钥，这也会有效。总是好的。由于您的key列上有索引，因此效率会相当高。

所以这就是你的循环最终看起来像：

 $weight = 0;
 $count = 0;
 $key = 0;
 $tagged = '';

 /* bind parameters and results to the get statement */
 $get_stmt->bind_result($key, $tagged);
 $get_stmt->bind_param('i', $previouskey);

 /* bind parameters to the post statement */
 $post_stmt->bind_param('iii',$weight, $count, $key);

 $done = false;
 while ( !$done ) {
    $get_stmt->execute();
    if ($get_stmt->fetch()) {

        /* do everything word - by - word  here on the $tagged string */

        /* do the post statement to store the results */
        $post_stmt->execute();

        /* update the previous key prior to next iteration */
        $previouskey = $key; 
        $get_stmt->reset();
        $post_stmt->reset();
    } /* end if fetch */
    else {
       /* no result returned! we are done! */
       $done = true;
    }
 } /* end while not done */

这应该让你每行进行亚秒处理。

Answer 2

首先，显而易见的优化是这样的：

include "functions.php";
set_time_limit(0); // NOTE: no time limit
if (!$conn)
    die('Not connected : ' . mysql_error());
$remove = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}'); // tags to remove       
$db = 'senate';
mysql_select_db($db);

$resultconcreteness = mysql_query('SELECT `word`, `score` FROM `concreteness`') or die(mysql_error());
$array = array(); // NOTE: init score cache
while($row = mysql_fetch_assoc($resultconcreteness))
    $array[strtolower($row['word'])] = $row['score']; // NOTE: php array as hashmap
mysql_free_result($resultconcreteness);

$data = mysql_query('SELECT `key`, `tagged` FROM `speechesLCMcoded`') or die(mysql_error()); // NOTE: single query instead of multiple
while ($row = mysql_fetch_assoc($data)) {
    $key = $row['key'];
    $tagged = $row['tagged'];
    $weight = $count = 0;
    $speech = explode(' ', $tagged);
    foreach ($speech as $word) {
        if (preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {
            $weight += $array[strtolower(str_replace($remove, '', $word))]; // NOTE: quick access to word's score
            $count++;
        }
    }
    mysql_query('UPDATE `speechesLCMcoded` SET `weight`='.$weight.', `count`='.$count.' WHERE `key`='.$key, $conn) or die(mysql_error());
}
mysql_free_result($data);

使用注意检查注释：

但是对于400K行，它将需要一些时间，至少因为你必须更新每一行，这意味着400K更新。

未来可能的优化：

让这个脚本获取起始偏移量和长度等参数（将它们传递给mysql LIMIT），这样你就可以运行多个脚本来同时处理不同的表块了
而不是更新 - 生成包含数据的文件，然后使用LOAD DATA INFILE替换您的表，它可能更快400K更新

文本语料库中的单词匹配非常慢

2 个答案: