我需要删除div和td标记,以便提取插入数据库之间的内容。但是对于某些约束,我必须使用正则表达式而不是xpath或DOM Document来提取内容。需要帮忙!谢谢
<tr class = "student_information" >
<div class="admin"><td>141234U</td></div>
<div class="name"><td>Tan Ping Ping</td></div>
<div class="hp"><td>82222222</td></div>
<div class="email"><td>141234U@mymail.nyp.edu.sg</td></div>
</tr>
<tr class = "student_information" >
<div class="admin"><td>132458Q</td></div>
<div class="name"><td>Tan Rui</td></div>
<div class="hp"><td>86339557</td></div>
<div class="email"><td>132458Q@hotmail.com</td></div>
Output:
141234U
Tan Ping Ping
82222222
141234U@mymail.nyp.edu.sg
132458Q
Tan Rui
86339557
132458Q@hotmail.com
答案 0 :(得分:0)
但是对于某些限制,我必须使用正则表达式而不是xpath或DOM 提取内容的文件
基于以上所述,您可以使用此正则表达式:(?<=>)([\w .@]+)(?=<)
,即:
$str = <<< EOF
<tr class = "student_information" >
<div class="admin"><td>141234U</td></div>
<div class="name"><td>Tan Ping Ping</td></div>
<div class="hp"><td>82222222</td></div>
<div class="email"><td>141234U@mymail.nyp.edu.sg</td></div>
</tr>
<tr class = "student_information" >
<div class="admin"><td>132458Q</td></div>
<div class="name"><td>Tan Rui</td></div>
<div class="hp"><td>86339557</td></div>
<div class="email"><td>132458Q@hotmail.com</td></div>
EOF;
preg_match_all('/(?<=>)([\w .@]+)(?=<)/', $str, $result, PREG_PATTERN_ORDER);
foreach($result[1] as $match){
echo $match."\n";
}
<强>输出:强>
141234U
Tan Ping Ping
82222222
141234U@mymail.nyp.edu.sg
132458Q
Tan Rui
86339557
132458Q@hotmail.com
正则表达式说明:
(?<=>)([\w.@]+)(?=<)
Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=>)»
Match the character “>” literally «>»
Match the regex below and capture its match into backreference number 1 «([\w.@]+)»
Match a single character present in the list below «[\w.@]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A “word character” (Unicode; any letter or ideograph, any number, underscore) «\w»
A single character from the list “.@” «.@»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=<)»
Match the character “<” literally «<»