web抓取<div>标签和标签内的内容

时间:2016-01-31 16:44:58

标签: web screen-scraping

我需要从http://www.hegnar.no/netfonds/aksjekurser/这个链接中抓取数据。其实我想从这个链接的表中删除数据。但是表的代码是在div标签中写的。我使用了php regex和file_get_content我无法帮助我解决它。

<?php

$html = file_get_contents("http://www.hegnar.no/netfonds/aksjekurser");


preg_match_all(
            '<tr>
<td class="left"><a href=".*?">(.*?)<\/a><\/td>.*?
<td class="left">(.*?)<\/td>.*?
<td name=".*?">(.*?)<\/td>.*?
<td name=".*?">(.*?)<\/td>.*?
<td>(.*?)<\/td>.*?
<td class="up" name=".*?">(.*?)<\/td>.*?
<td class="up" name=".*?">(.*?)<\/td>.*?
<td>(.*?)<\/td>.*?
<td>(>*?)<\/td>.*?
<td>(.*?)<\/td>.*?
<td>(.*?)<\/td>.*?
<td name=".*?">(.*?)<\/td>
<td name=".*?">(.*?)<\/td><\/tr>/s',


$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
$selskap = $post[1];
$ticket = $post[2];
$siste = $post[3];
$kejop = $post[4];
$slag = $post[5];
$ending = $post[6];
$ending2 = $post[7];
$apring = $post[8];
$lav = $post[9];
$hoy = $post[10];
$forrige = $post[11];
$volume = $post[12];
$ratio = $post[13];



echo "$selskap</br>";
echo "$ticket</br>";
echo "$siste</br>";
echo "$kejop</br>";
echo "$slag</br>";
echo "$ending</br>";
echo "$ending2</br>";
echo "$apring</br>";
echo "$lav</br>";
echo "$hoy</br>";
echo "$forrige</br>";
echo "$volume</br>";
echo "$ratio</br>";


}

echo "<p>" . count($posts) . " posts found</p>";

2 个答案:

答案 0 :(得分:1)

您可以使用此库 PHP Simple HTML DOM Parser

另请参阅此问题:Extract Information from HTML

答案 1 :(得分:0)

你的正则表达式中至少有1个拼写错误:

<td>(>*?)<\/td>.*?

可能意味着写成:

<td>(.*?)<\/td>.*?