我正在尝试从网页中提取数据以将其插入数据库。我感兴趣的数据是div中有一个class =“company”。在一个网页上有15个或更少的div,并且有很多页面我试图从中提取这些数据。出于这个原因,我试图找到一种数据提取的自动解决方案。
class =“company”的div如下所示(在一页上有不同数据的15个或更少的div):
<div class="company" id="company-6666"> <!-- EXTRACT 'company-6666' from id="company-6666" -->
<div class="top clearfix">
<div class="name clearfix">
<h2>
<a href="/company-name">Company Name</a> <!-- EXTRACT 'Company Name' from contents of A element and EXTRACT '/company-name' from href attribute -->
<a href="/branches-list-link?parent_id=6666" class="branches">Branches <span>(5)</span></a> <!-- EXTRACT '/branches-list-link?parent_id=6666' from href attribute -->
</h2>
</div>
</div>
<div class="inner clearfix has-logo">
<div class="clearfix">
<div class="logo">
<a href="/company-name">
<img src="/graphics/company/logo/listing/123456.jpg?_ts=1365390237" border="0" alt="" /> <!-- EXTRACT '/graphics/company/logo/listing/123456.jpg?_ts=1365390237' from src attribute -->
</a>
</div>
<div class="info">
<div class="address">StreetName 500, 7777 City, County</div> <!-- EXTRACT 'StreetName 500, 7777 City, County' from contents of class="address" div -->
<div class="clearfix">
<div class="slogan">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.</div> <!-- EXTRACT 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi ac condimentum mi.' from contents of class="slogan" div -->
</div>
</div>
</div>
<div class="actions-bar clearfix">
<ul>
<li><span class="phone-number">6666666</span></li> <!-- EXTRACT '6666666' from contents of class="phone-number" div -->
<li><a href="mailto:mail@mail.com" target="_blank" title="mail@mail.com" class="email">mail@mail.com</a></li> <!-- EXTRACT 'mail@mail.com' from contents of class="email" div -->
<li><a href="http://www.webpage.com" target="_blank" title="www.webpage.com" class="redirect url">www.webpage.com</a></li> <!-- EXTRACT 'www.webpage.com' from contents of class="redirect url" div -->
</ul>
</div>
</div>
</div>
到目前为止,我有以下PHP代码($ output有网页的HTML代码):
<?php
$doc = new DomDocument();
@$doc->loadHTML($output);
$doc->preserveWhiteSpace = false;
$xpath = new DomXPath($doc);
$elements = $xpath->query("//*[@class='company']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo $element->nodeValue;
}
}
?>
它似乎得到了所有15个div与class =“company”,但我不知道如何提取前面提到的(在HTML代码的注释中)个别值。
每个div(我说的是带有class =“company”的div)都没有在HTML块中写入的所有值。所以不知何故,如果公司div中的特定div(我感兴趣的数据存在于其中)存在,如果它存在,我必须检查它是否为空(包含标签之间的文本)。如果它存在且不为空,我将其添加到变量中。
一旦提取了值,我想将它们分配给PHP变量,让我以后再与它们一起工作。如果提取的值放在数组中会更好:
$result = array(
// 1'st div's data
[0] =>
'company name' => 'company name',
'company link' => 'company link',
'company id' => 'company id',
'company branches' => 'branches link',
'company logo' => 'logo',
'company address' => 'address',
'company slogan' => 'slogan',
'company webpage' => 'webpage',
'company email' => 'email',
'company phone' => 'phone'
// 2'nd div's data
[1] =>
'company name' => 'company name',
'company link' => 'company link',
'company id' => 'company id',
'company branches' => 'branches link',
'company logo' => 'logo',
'company address' => 'address',
'company slogan' => 'slogan',
'company webpage' => 'webpage',
'company email' => 'email',
'company phone' => 'phone'
...
)
答案 0 :(得分:2)
每个 Company 都可以由context-node表示,同时让每个属性由相对于它的xpath-expression表示:
Company company-6666:
->id ....... = "company-6666" -- string(@id)
->name ..... = "Company Name" -- .//a[1]/text()
->href ..... = "/company-name" -- .//a[1]/@href
->img ...... = "/graphics/company/logo/listing/123456.jpg?_ts=1365390237" -- .//img[1]/@src
->address .. = "StreetName 500, 7777 City, County" -- .//*[@class="address"]/text()
...
如果将它包装到对象中,这非常好用:
$doc = new DOMDocument();
$doc->loadHTML($html);
/* @var $companies DOMValueObject[] */
$companies = new Companies($doc);
foreach ($companies as $company) {
printf("Company %s:\n", $company->id);
foreach ($company->getObjectProperties() as $name => $value) {
$expression = $company->getPropertyExpression($name);
printf(" ->%'.-10s = \"%s\" -- %s\n", $name.' ', $value, $expression);
}
}
这适用于 DOMObjectCollection 和 DOMValueObject ,定义了您自己的类型:
class Companies extends DOMValueCollection
{
public function __construct(DOMDocument $doc) {
parent::__construct($doc, '//*[@class="company"]');
}
/**
* @return DOMValueObject
*/
public function current() {
$object = parent::current();
$object->defineProperty('id', 'string(@id)');
$object->defineProperty('name', './/a[1]/text()');
$object->defineProperty('href', './/a[1]/@href');
$object->defineProperty('img', './/img[1]/@src');
$object->defineProperty('address', './/*[@class="address"]/text()');
# ... add your definitions
return $object;
}
}
对于您的数组要求,有getArrayCopy()
方法:
echo "\nGet Array Copy:\n\n";
print_r($companies->getArrayCopy());
输出:
Get Array Copy:
Array
(
[0] => Array
(
[id] => company-6666
[name] => Company Name
[href] => /company-name
[img] => /graphics/company/logo/listing/123456.jpg?_ts=1365390237
[address] => StreetName 500, 7777 City, County
)
)
答案 1 :(得分:1)
要检查节点是否存在,请在返回的查询结果中验证length属性是否等于1:
if ($company_name->length == 1) {
$object->company_name = trim($company_name->item(0)->nodeValue);
}