如何在php中使用正则表达式删除<tags>

时间:2016-05-12 01:55:22

标签: php regex tags

我需要删除div和td标记,以便提取插入数据库之间的内容。但是对于某些约束,我必须使用正则表达式而不是xpath或DOM Document来提取内容。需要帮忙!谢谢

 <tr class = "student_information" >
            <div class="admin"><td>141234U</td></div>
            <div class="name"><td>Tan Ping Ping</td></div>
            <div class="hp"><td>82222222</td></div>
            <div class="email"><td>141234U@mymail.nyp.edu.sg</td></div>
        </tr>
                    <tr class = "student_information" >
            <div class="admin"><td>132458Q</td></div>
            <div class="name"><td>Tan Rui</td></div>
            <div class="hp"><td>86339557</td></div>
            <div class="email"><td>132458Q@hotmail.com</td></div>

 Output: 

 141234U
 Tan Ping Ping
 82222222
 141234U@mymail.nyp.edu.sg

 132458Q
 Tan Rui
 86339557
 132458Q@hotmail.com

1 个答案:

答案 0 :(得分:0)

  

但是对于某些限制,我必须使用正则表达式而不是xpath或DOM   提取内容的文件

基于以上所述,您可以使用此正则表达式:(?<=>)([\w .@]+)(?=<),即:

$str = <<< EOF
 <tr class = "student_information" >
            <div class="admin"><td>141234U</td></div>
            <div class="name"><td>Tan Ping Ping</td></div>
            <div class="hp"><td>82222222</td></div>
            <div class="email"><td>141234U@mymail.nyp.edu.sg</td></div>
        </tr>
                    <tr class = "student_information" >
            <div class="admin"><td>132458Q</td></div>
            <div class="name"><td>Tan Rui</td></div>
            <div class="hp"><td>86339557</td></div>
            <div class="email"><td>132458Q@hotmail.com</td></div>
EOF;

preg_match_all('/(?<=>)([\w .@]+)(?=<)/', $str, $result, PREG_PATTERN_ORDER);
foreach($result[1] as $match){
echo $match."\n";
}

<强>输出:

141234U
Tan Ping Ping
82222222
141234U@mymail.nyp.edu.sg
132458Q
Tan Rui
86339557
132458Q@hotmail.com

正则表达式说明:

(?<=>)([\w.@]+)(?=<)

Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=>)»
   Match the character “>” literally «>»
Match the regex below and capture its match into backreference number 1 «([\w.@]+)»
   Match a single character present in the list below «[\w.@]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A “word character” (Unicode; any letter or ideograph, any number, underscore) «\w»
      A single character from the list “.@” «.@»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=<)»
   Match the character “<” literally «<»