PHP:用于非格式化文本的自定义解析器

时间:2018-01-03 14:54:25

标签: php arrays regex parsing

我正在尝试创建一个项目,帮助学生学习各个领域。我的想法是,我有一段原始文本,其中包含测验问题和答案,我想将其解析为问题标题和答案选项,这些问题将被插入到数据库中。但是,文本格式不正确,并且由于大量的问题和答案(大约每个约20k),我没有时间手动插入它们或自己格式化文本。

原始文本如下所示:

1. A car averages 27 miles per gallon. If gas costs $4.04 per gallon, which of the following is closest to how much the gas would cost for this car to travel 2,727 typical miles?

a) $44.44 b) $109.08 c) $118.80
d) $408.04 e) 
$444.40

2. When x = 3 and y = 5, by how much does the value of 3x2 – 2y exceed the value of 2x2 – 3y ?
a) 4
b) 14
c) 16
d) 20 e) 50

我尝试创建自己的PHP函数来正确解析文本,但是我不能让自己通过随机换行符,空格等。

我想要获得的东西:

array(1) {
  [0]=>
  array(3) {
    ["questionNumber"]=>
    string(1) "1"
    ["questionText"]=>
    string(175) "A car averages 27 miles per gallon. If gas costs $4.04 per gallon, which of the following is closest to how much the gas would cost for this car to travel 2,727 typical miles?"
    ["options"]=>
    array(5) {
      ["a"]=>
      string(6) "$44.44"
      ["b"]=>
      string(7) "$109.08"
      ["c"]=>
      string(7) "$118.80"
      ["d"]=>
      string(7) "$408.04"
      ["e"]=>
      string(7) "$444.40"
    }
  }
}

到目前为止我的代码:

$rawText = '1. A car averages 27 miles per gallon. If gas costs $4.04 per gallon, which of the following is closest to how much the gas would cost for this car to travel 2,727 typical miles?

a) $44.44 b) $109.08 c) $118.80
d) $408.04 e)
$444.40

2. When x = 3 and y = 5, by how much does the value of 3x2 – 2y exceed the value of 2x2 – 3y ?
a) 4
b) 14
c) 16
d) 20 e) 50
';

$rawTextLines = explode("\n", $rawText);
foreach ($rawTextLines as $lineNumber => $lineContents) {
    $lContents = trim($lineContents);

    if (empty ($lContents)) {
        unset ($rawTextLines[$lineNumber]);
    } else {
        $rawTextLines[$lineNumber] = $lContents;
    }
}

$processedQuestions    = array ();
$currentQuestionHeader = 0;
foreach ($rawTextLines as $lineNumber => $lineContents) {
    if (ctype_digit(substr($lineContents, 0, 1))) { // Question header
        $questionHeaderInformation = explode('.', $lineContents);
        $currentQuestionHeader     = $questionHeaderInformation[0];

        $processedQuestions[$currentQuestionHeader]['questionNumber'] = $currentQuestionHeader;
        $processedQuestions[$currentQuestionHeader]['questionText']   = $questionHeaderInformation[1];
    } else { // Question option
        $options = explode(')', $lineContents);

        if (count ($options) % 2 === 0) {
            $processedQuestions[$currentQuestionHeader]['options'][trim($options[0])] = ucfirst(trim($options[1]));
        } else {

        }
    }
}

产生这个:

array(2) {
  [1]=>
  array(3) {
    ["questionNumber"]=>
    string(1) "1"
    ["questionText"]=>
    string(35) " A car averages 27 miles per gallon"
    ["options"]=>
    array(1) {
      ["a"]=>
      string(8) "$44.44 b"
    }
  }
  [2]=>
  array(3) {
    ["questionNumber"]=>
    string(1) "2"
    ["questionText"]=>
    string(96) " When x = 3 and y = 5, by how much does the value of 3x2 – 2y exceed the value of 2x2 – 3y ?"
    ["options"]=>
    array(3) {
      ["a"]=>
      string(1) "4"
      ["b"]=>
      string(2) "14"
      ["c"]=>
      string(2) "16"
    }
  }
}

正如您所看到的,当前输出不匹配 - 不是到目前为止,我想要获得的。

提前谢谢。

1 个答案:

答案 0 :(得分:0)

Hellow,



^[0-9]+\. (.*)[\r\n]+a\)[\s]+(.*)[\s]+b\)[\s]+(.*)[\s]+c\)[\s]+(.*)[\s]+d\)[\s]+(.*)[\s]+e\)[\s]+(.*)[\s]*




Try it !

$re = '/^[0-9]+\. (.*)[\r\n]+a\)[\s]+(.*)[\s]+b\)[\s]+(.*)[\s]+c\)[\s]+(.*)    [\s]+d\)[\s]+(.*)[\s]+e\)[\s]+(.*)[\s]*/m';
$str = '1. A car averages 27 miles per gallon. If gas costs $4.04 per gallon, which of the following is closest to how much the gas would cost for this car to travel 2,727 typical miles?

a) $44.44 b) $109.08 c) $118.80
d) $408.04 e) 
$444.40

2. When x = 3 and y = 5, by how much does the value of 3x2 – 2y exceed the value of 2x2 – 3y ?
a) 4
b) 14
c) 16
d) 20 e) 50';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);