如何在特定的HTML Dom之后获取字符串

时间:2018-08-10 10:31:00

标签: php dom web-crawler simple-html-dom

这是html:

<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade@email.com<br> 
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>

我想要的输出是:

+88 01756567676

frank.wade@email.com

NAC739

我使用了simple_html_dom来解析数据。

这是我编写的代码。如果联系信息部分用段落标签包裹,则该方法有效。 (

$contact = $facultyData->find('strong[plaintext^=Phone]');
$contact = $contact[0]->parent();
$element = explode("\n", strip_tags($contact->plaintext));

$regex = '/Phone:(.*)/';
if (preg_match($regex, $element[0], $match)) 
    $phone = $match[1];

$regex = '/Email:(.*)/';
if (preg_match($regex, $element[1], $match)) 
    $email = $match[1];

$regex = '/Office:(.*)/';
if (preg_match($regex, $element[2], $match)) 
    $office = $match[1];

有什么办法可以通过与tag匹配来获取这3行吗?

3 个答案:

答案 0 :(得分:1)

也许您可以使用xpath函数

$xml = new SimpleXMLElement($DomAsString);
$theText = $xml->xpath('//strong[. ="Phone"]/following-sibling::text()');

一些片段删除了':',当然还修复了dom结构

答案 1 :(得分:0)

您实际上不需要将其解析为HTML或处理DOM树。您可以将HTML字符串分解成小块,然后删除每块中多余的内容以获得所需的内容:

<?php 

$str = <<<str
<td width="551">
<p><strong>Full Time Faculty<br>
<strong></strong>Assistant Professor</strong></p>Doctorate of Business Administration<br><br>
<strong>Phone</strong>: +88 01756567676<br>
<strong>Email</strong>: frank.wade@email.com<br>
<strong>Office</strong>: NAC739<br>
<br><p><b>Curriculum Vitae</b></p></td>
str;

// We explode $str and use '</strong>' as delimiter and get only the part of result that we need
$lines = array_slice(explode('</strong>', $str), 3, 3);
// Define a function to remove extra text from left and right of our so called lines
function stripLine($line) {
    // ltrim ' ;' characters and remove everything after (and including) '<br>'
    return preg_replace('/<br>.*/is', '', ltrim($line, ' :'));
}
$lines = array_map('stripLine', $lines);

print_r($lines);

请参见代码输出here

答案 2 :(得分:0)

或仅使用正则表达式:

preg_match('|Phone</strong>: [^<]+|', $str, $m) or die('no phone');
$phone = $m[1];