Question

我需要删除div和td标记，以便提取插入数据库之间的内容。但是对于某些约束，我必须使用正则表达式而不是xpath或DOM Document来提取内容。需要帮忙！谢谢

 <tr class = "student_information" >
            <div class="admin"><td>141234U</td></div>
            <div class="name"><td>Tan Ping Ping</td></div>
            <div class="hp"><td>82222222</td></div>
            <div class="email"><td>141234U@mymail.nyp.edu.sg</td></div>
        </tr>
                    <tr class = "student_information" >
            <div class="admin"><td>132458Q</td></div>
            <div class="name"><td>Tan Rui</td></div>
            <div class="hp"><td>86339557</td></div>
            <div class="email"><td>132458Q@hotmail.com</td></div>

 Output: 

 141234U
 Tan Ping Ping
 82222222
 141234U@mymail.nyp.edu.sg

 132458Q
 Tan Rui
 86339557
 132458Q@hotmail.com

Answer 1

但是对于某些限制，我必须使用正则表达式而不是xpath或DOM 提取内容的文件

基于以上所述，您可以使用此正则表达式：(?<=>)([\w .@]+)(?=<)，即：

$str = <<< EOF
 <tr class = "student_information" >
            <div class="admin"><td>141234U</td></div>
            <div class="name"><td>Tan Ping Ping</td></div>
            <div class="hp"><td>82222222</td></div>
            <div class="email"><td>141234U@mymail.nyp.edu.sg</td></div>
        </tr>
                    <tr class = "student_information" >
            <div class="admin"><td>132458Q</td></div>
            <div class="name"><td>Tan Rui</td></div>
            <div class="hp"><td>86339557</td></div>
            <div class="email"><td>132458Q@hotmail.com</td></div>
EOF;

preg_match_all('/(?<=>)([\w .@]+)(?=<)/', $str, $result, PREG_PATTERN_ORDER);
foreach($result[1] as $match){
echo $match."\n";
}

<强>输出：

141234U
Tan Ping Ping
82222222
141234U@mymail.nyp.edu.sg
132458Q
Tan Rui
86339557
132458Q@hotmail.com

正则表达式说明：

(?<=>)([\w.@]+)(?=<)

Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=>)»
   Match the character “>” literally «>»
Match the regex below and capture its match into backreference number 1 «([\w.@]+)»
   Match a single character present in the list below «[\w.@]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A “word character” (Unicode; any letter or ideograph, any number, underscore) «\w»
      A single character from the list “.@” «.@»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=<)»
   Match the character “<” literally «<»

如何在php中使用正则表达式删除<tags>

1 个答案: