通过换行符拆分文件中的内容?

时间:2017-04-24 03:11:45

标签: php regex nlp

我有一个包含以下内容的文件:

( (CODE <begin_A_defense_of_Michael_Moore>))

( (NP (NP (NP (DT A) (NN defense))
      (PP (IN of)
          (NP (NP (NNP Michael) (NNP Moore))
          (CC and)
          (" ")
          (S-NOM-TTL (NP-SBJ (-NONE- *PRO*))
                 (VP (VBG Bowling)
                 (PP-PRP (IN for)
                     (NP (NNP Columbine))))))))
      (" ")
      (CODE -LRB-)
      (PRN (NP (NN Op-Ed)))
      (CODE -RRB-)
      (PP (IN By)
      (NP (NNP Eloquence)))))

( (FRAG (NP (NNP Wed))
    (NP (NML (NNP Aug))
        (JJ 13th)
        (, ,)
        (NN 2003))
    (PP-TMP (IN at)
        (NP (CD 09:00:09)
            (FW AM) (FW EST)))))

( (S (NP-SBJ (DT This))
     (VP (VBZ is)
     (NP-PRD (NP (DT an) (JJ open) (NN letter))
         (PP (IN to)
             (NP (NP (NNP David) (NNP Hardy))
             (, ,)
             (NP (NP (NN author))
                 (PP (IN of)
                 (NP (NP-TTL (S-NOM-TTL (NP-SBJ (-NONE- *PRO*))
                            (VP (VB Bowling)
                                (PP-PRP (IN for)
                                    (NP (NNP Columbine)))))
                         (: :)
                         (NP (NN Documentary) (CC or) (NN Fiction)))
                     (, ?)
                     (, ,)
                     (RRC (ADVP (RB probably))
                      (NP-PRD (NP (DT the)
                              (ADJP (RBS most) (JJ comprehensive)))
                          (PP (IN among)
                              (NP (NP (JJ many) (NNS rebuttals))
                              (PP (IN of)
                                  (NP (DT the)
                                  (ADJP (NNP Oscar) (HYPH -) (VBG winning))
                                  (NN documentary))))))))))))))
     (. .)))

( (S (NP-SBJ (NNS Critics))
     (VP (VBP have)
     (ADVP-TMP (RB now))
     (VP (VBN gone)
         (ADVP (ADVP (RB so) (RB far))
           (SBAR (IN as)
             (S (NP-SBJ (-NONE- *PRO*))
                (VP (TO to)
                (VP (VB call)
                    (PP-CLR (IN for)
                        (NP (NP (DT the) (NN revocation))
                        (PP (IN of)
                            (NP (DT the) (NN award))))))))))))
     (. .)))

( (S (NP-SBJ (PRP$ Their) (NNS chances))
     (VP (VBP are)
     (ADJP-PRD (JJ small))
     (, ,)
     (ADVP (RB however))
     (, ,)
     (SBAR-PRP (IN as)
           (S (NP-SBJ (PRP$ their) (NNS arguments))
              (VP (VP (VBP rely)
                  (PP-CLR=1 (IN on)
                    (NP (NN polemic) (, ,) (NN exaggeration) (CC and) (NN misrepresentation))))
              (: --)
              (VP (PP (IN in)
                  (NP (JJ other) (NNS words)))
                  (, ,)
                  (PP-CLR=1 (IN on)
                    (NP (NP (DT the) (JJ same) (NNS techniques))
                        (SBAR (WHNP-2 (WP which))
                          (S (NP-SBJ (PRP they))
                             (VP (VBP accuse)
                             (NP (NNP Moore))
                             (PP-CLR (IN of)
                                 (S-NOM (NP-SBJ (-NONE- *PRO*))
                                    (VP (VBG using)
                                        (NP (-NONE- *T*-2)))))))))))))))
     (. .)))

我需要单独进行每个特定的解析。我认为最好的方法是用新的空行拆分这个文件(有没有其他方法)。有没有人知道如何做到这一点?我正在使用PHP。 该文件来自MASC语料库。

感谢。

1 个答案:

答案 0 :(得分:0)

我实际上是通过以下方式完成的:

$newfile= file("textfile.txt");
$temp_str='';
$parses=array();
foreach ($newfile as $line) {
    $temp=trim($line);
    if(strlen($temp)>0){
        $temp_str.=$temp;
    }
    else{
        array_push($parses, $temp_str);
        $temp_str='';       
    }  
}