我需要从一些文本文件中获取具有某些规范的句子,然后将它们存储到数据库中

时间:2012-06-27 02:57:17

标签: php text

我的文字由一些句子组成。我必须解析每个句子中用点和计数词分隔的句子。包含5个以上单词的句子将被插入到数据库中。这是我的代码:

<?php

require_once 'conf/conf.php';// connect to database

function saveContent ($text) {
  //I have to get every sentence without lose the dot
  $text1 = str_replace('.', ".dot", $text);
  $text2 = explode ('dot',$text1); 

  //Text that contain ' cannot be inserted to database, so i need to remove it 
  $text3 = str_replace("'", "", $text2); 

  //Selecting the sentence that only consist of more than words
  for ($i=0;$i<count($text3);$i++){
    if(count(explode(" ", $text3[$i]))>5){
      $save = $text3[$i];

      $q0 = mysql_query("INSERT INTO tbdocument VALUES('','$files','".$save."','','','') ");
    }
  }
}

$text= "I have some text files in my folder. I get them from extraction process of pdf journals files into txt files. here's my code";
$a = saveContent($text);

?>

结果只有一个句子(第一句)可以插入数据库中。 我需要你的帮助,非常感谢你:)。

1 个答案:

答案 0 :(得分:0)

有很多方法可以改善这一点(并使其正常工作)。

不是将.替换为.dot,而是可以简单地在.上展开,并记得稍后替换它。但是,如果你的句子类似 Mr。史密斯去了华盛顿。?你无法以可靠性来区分这些时期。

$files中的变量INSERT未在此函数的范围内定义。我们不知道它来自何处或者您希望它包含什么,但在这里,它将为NULL。

function saveContent ($text) {
  // Just explode on the . and replace it later...
  $sentences = explode(".", $text);

  // Don't remove single quotes. They'll be properly escaped later...

  // Rather than an incremental loop, use a proper foreach loop:
  foreach ($sentences as $sentence) {
    // Using preg_split() instead of explode() in case there are multiple spaces in sequence
    if (count(preg_split('/\s+/', $sentence)) > 5) {
      // Escape and insert
      // And add the . back onto it
      $save = mysql_real_escape_string($sentence) . ".";

      // $files is not defined in scope of this function!
      $q = mysql_query("INSERT INTO tbdocument VALUES('', '$files', '$sentence', '', '', '')");
      // Don't forget to check for errors.
      if (!$q) {
        echo mysql_error();
      }
    }
  }
}

从长远来看,请考虑远离mysql_*()函数,并开始学习支持预处理语句(如PDO或MySQLi)的API。旧的mysql_*()函数很快就会被弃用,缺乏预处理语句提供的安全性。