使用PHP刮取不一致的搜索结果

时间:2017-04-30 08:40:11

标签: php curl xpath web-scraping scrape

如何使用不一致的项目来搜索搜索结果列表?

以下是一个例子:

在此搜索结果中,您将找到4个商家: https://www.11880.com/suche/0521441422/deutschland

现在,并非这4家企业都包含开放时间信息: 第一个没有,最后3个企业包含开放时间信息。

因此,如果我尝试使用下面的脚本执行此操作,则开放时间信息会与错误的商家相关=>它得到了连接"与前3个企业,而不是最近3个企业。

如何修改脚本,以便将营业时间与正确的业务联系起来?

<?php

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20120101 Firefox/33.0');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, 'https://www.11880.com/suche/0521441422/deutschland');
$page = curl_exec($ch);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);

$results = [];
$results['name'] = $xpath->query('//h2[@itemprop="name"]');
$results['street'] = $xpath->query('//span[@class="street-address"]');
$results['zipcode'] = $xpath->query('//span[@class="postal-code"]');
$results['city'] = $xpath->query('//span[@class="address-locality"]');
$results['district'] = $xpath->query('//span[@class="quarter"]');
$results['opening_hours'] = $xpath->query('//span[@class="open-or-closed"]');


//*[@id="html-search-result-list"]/li[3]/div/div[3]/div[1]/span[1]
#html-search-result-list > li:nth-child(3) > div > div.row-result-entry--bottom.row > div.col-result-entry-content--contactinfos.hidden-xs.col-sm-8 > span.btn-ghost.btn-ghost-primary.btn-result-entry-interaction.open-or-closed.open

for($x=0; $x < $results['name']->length;$x++)
{
  echo trim($results['name']->item($x)->textContent) . ";";
  echo trim($results['street']->item($x)->textContent) . ";";
  echo trim($results['zipcode']->item($x)->textContent) . ";";
  echo trim($results['city']->item($x)->textContent) . ";";
  echo trim($results['district']->item($x)->textContent) . ";";
  echo trim($results['opening_hours']->item($x)->textContent) . "<br>\n";
}

?>

1 个答案:

答案 0 :(得分:1)

你可以这样做。这只是一个草案

// Find parent divs
$divs = $xpath->query('//h2[@itemprop="name"]/ancestor::div[1]');
for($x=0; $x < $divs->length;$x++) {
   // Find items, you want, inside div
   $name = $xpath->query('.//h2[@itemprop="name"]', $divs[$x]);
   if ($name) {
      echo trim($name->item(0)->textContent) . ";";
   }
// ...
}