我正在为一个特定新闻网站的新闻文章构建一个小解析器。我希望代码也准备好添加其他新闻页面,这就是为什么它就像它一样。
我希望页面在不刷新页面的情况下重新加载其内容。我知道我需要一段时间才能从所选URL中检索内容。这就是为什么我想从jqueryui添加进度条(我知道它的分配要求)。进度条是可选的。
我也使用简单的html dom解析器
<?php
//Page load time
$starttime = explode(' ', microtime());
$starttime = $starttime[1] + $starttime[0];
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>svd parser</title>
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon"/>
<link rel="stylesheet" type="text/css" href="style.css"/>
<script type="text/javascript" src="jquery-1.10.2.min.js"></script>
</head>
<body>
<div class="container">
<div id="head">
<h1>svd parser</h1>
<hr>
<form action="index.php" method="post">
<input type="text" name="s" placeholder="enter a URL to start the svd parser" style="width: 495px;">
<input type="submit" value="svd parser it">
</form>
<?php
if (isset($_POST["s"]) && trim($_POST["s"]) !="") {
//what is the domain?
preg_match('@^(?:http://)?([^/]+)@i',$_POST["s"], $matches);
$host = $matches[1];
// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "<b>domain name is: {$matches[0]}.</b><br>\n";
function checkDomainGetRightValues($domain) {
if ($domain == "svd.se") {
$h1="h1";
$page="p[class=preamble], div[class=articletext]";
return array('h1'=> $h1,'searchparse' => $page);
}else {
return null;
}
}
include('simple_html_dom.php');
$html = new simple_html_dom();
$ids=checkDomainGetRightValues($matches[0]);
//Get the page
$html = file_get_html($_POST['s']);
// Find all h1
$ret = $html->find($ids['h1']);
//Strip the h1 of all html tags (a href) add h1 tags
echo "<h1>" . strip_tags($ret[0]) . "</h1>";
//find the actual article and forget about everything else
//Function for extraction right parse lines
//$values= checkDomainGetRightValues($matches[0]);
$ret = $html->find($ids['searchparse']);
//prints article with OUT HTML tags, but with <p> so you can read it
//Print the first part of article so you get a hint what it is all about
echo "<p><b>". strip_tags($ret[0]) ."</b></p>";
//Here is the actuall article
$a=html_entity_decode($ret[1]);
echo strip_tags($a, '<p>');
$html->clear();
unset($html);
}else{
echo "You need to write the whole article URL<br>";
}
//Page load time
$mtime = explode(' ', microtime());
$totaltime = $mtime[0] + $mtime[1] - $starttime;
printf('Page loaded in %.3f seconds.', $totaltime);
?>
</div>
<div id="sidebar">
<b>SvD
</div>
</div>
</body>
</html>
如果有人能够至少指出我正确的方向,我将不胜感激!