这是html:
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade@email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
我想要的输出是:
+88 01756567676
frank.wade@email.com
NAC739
我使用了simple_html_dom来解析数据。
这是我编写的代码。如果联系信息部分用段落标签包裹,则该方法有效。 (
)
$contact = $facultyData->find('strong[plaintext^=Phone]');
$contact = $contact[0]->parent();
$element = explode("\n", strip_tags($contact->plaintext));
$regex = '/Phone:(.*)/';
if (preg_match($regex, $element[0], $match))
$phone = $match[1];
$regex = '/Email:(.*)/';
if (preg_match($regex, $element[1], $match))
$email = $match[1];
$regex = '/Office:(.*)/';
if (preg_match($regex, $element[2], $match))
$office = $match[1];
有什么办法可以通过与tag匹配来获取这3行吗?
答案 0 :(得分:1)
也许您可以使用xpath函数
$xml = new SimpleXMLElement($DomAsString);
$theText = $xml->xpath('//strong[. ="Phone"]/following-sibling::text()');
一些片段删除了':',当然还修复了dom结构
答案 1 :(得分:0)
您实际上不需要将其解析为HTML或处理DOM树。您可以将HTML字符串分解成小块,然后删除每块中多余的内容以获得所需的内容:
<?php
$str = <<<str
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade@email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
str;
// We explode $str and use '</strong>' as delimiter and get only the part of result that we need
$lines = array_slice(explode('</strong>', $str), 3, 3);
// Define a function to remove extra text from left and right of our so called lines
function stripLine($line) {
// ltrim ' ;' characters and remove everything after (and including) '<br>'
return preg_replace('/<br>.*/is', '', ltrim($line, ' :'));
}
$lines = array_map('stripLine', $lines);
print_r($lines);
请参见代码输出here。
答案 2 :(得分:0)
或仅使用正则表达式:
preg_match('|Phone</strong>: [^<]+|', $str, $m) or die('no phone');
$phone = $m[1];