我需要从在线时间表(对于学校)中获取课程到数组中。所以我可以将数据插入我的数据库。在线时间表(网址:roosters-hd.stenden.com)如下所示:
在左手边,我们看到时间,并在最上面的学位(Mo,Tu,We,Th,Fr)。非常基本。
除此之外,我还需要获取[startDate]和[endDate]。时间取决于课程单元所在的行以及它具有多少行距。可以通过在开始日期(在顶部打印)上添加列号来计算日期。 所以最后数组看起来像这样:
[0] => Array
(
[0] => Array
(
[Name] => Financiering
[Type] => WC
[Code] => DECBE3
[Classroom] => E2.053 - leslokaal
[Teacher] => Verboeket, Erik (E)
[Class] => BE1F, BE1B, BE1A
[StartDate] => 04/06/2013 08:30:00
[EndDate] => 04/06/2013 10:00:00
)
etc.
由于我缺乏获取数据的经验,我将最终得到一个效率极低且不灵活的解决方案。我应该使用XML解析器吗?还是Regex?关于如何解决这个问题的任何想法?
答案 0 :(得分:2)
正则表达方式:
<pre><?php
$html = file_get_contents('the_url.html');
$clean_pattern = <<<'LOD'
~
# definitions
(?(DEFINE)
(?<start> <!--\hSTART\hOBJECT-CELL\h--> )
(?<end> (?>[^<]++|<(?!!--))*<!--\hEND\hOBJECT-CELL\h--> )
(?<next_cell> (?>[^<]++|<(?!td\b))*<td[^>]*+> )
(?<cell_content> [^<]*+ )
)
# pattern
\g<start>
\g<next_cell> (?<Name> \g<cell_content> )
\g<next_cell> (?<Type> \g<cell_content> )
\g<next_cell> (?<Code> \g<cell_content> )
\g<next_cell> (?<Classroom> \g<cell_content> )
\g<next_cell>
\g<next_cell> (?<Teacher> \g<cell_content> )
\g<next_cell>
\g<next_cell> (?<Class> \g<cell_content> )
\g<end>
~x
LOD;
preg_match_all($clean_pattern, $html, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo <<<LOD
Name: {$match['Name']}
Type: {$match['Type']}
Code: {$match['Code']}
Classroom: {$match['Classroom']}
Teacher: {$match['Teacher']}
Class: {$match['Class']}<br/><br/>
LOD;
}
DOM / XPath方式:
$doc = new DOMDocument();
@$doc->loadHTMLFile('the_url.html');
$xpath = new DOMXPath($doc);
$elements = $xpath->query("//*[comment() = ' START OBJECT-CELL ']");
$fields = array('Name', 'Type', 'Code', 'Classroom', 'Teacher', 'Class');
$not_needed = array(10,8,6,1,0);
foreach ($elements as $element) {
$temp = explode("\n", $element->nodeValue);
foreach ($not_needed as $val) { unset($temp[$val]); }
array_walk($temp, function (&$item){ $item = trim($item); });
$result[] = array_combine($fields, $temp);
}
print_r ($result);