很遗憾在这里将报废内容存储到MYSQL数据库时遇到了一些麻烦。
所以我要做的是将模块代码和模块标题从此站点[http://www.ucc.ie/modules/descriptions/page014.html] [1]保存到MYSQL数据库中。我能够从网站上获取内容,但我似乎无法将报废的内容保存到数据库中。我不断收到错误“查询为空”而查询不是。
已经花了一些时间做这个并且似乎无法解决它。任何帮助解决这个问题将不胜感激。
<?php
//Here is a simple web scraping example using the PHP DOM that tries to get the largest text body of a HTML document. I needed it for a spider
//that had to show a short description for a page. It assumes that document annotation can be the largest <div>, <td> or <p> element in the //page.
//In the example I show a way to prevent a bug in the DOM as it sometimes just doesn't recognize html encoding. It seems to work if you put
//charset meta tag right after the head tag of the document.
$host="localhost";
$user="root";
$password="";
mysql_connect($host,$user,$password) or die("could not connect to the host");
mysql_select_db("plot_a_coursedb");
$ch= curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://www.ucc.ie/modules/descriptions/page014.html' );
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch,CURLOPT_VERBOSE,1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt ($ch, CURLOPT_REFERER, 'http://localhost:8080/extractsite/index2.html'); //just a fake referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_POST,0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 20);
$html= curl_exec($ch);
$html1= curl_getinfo($ch);
//try to get page encoding as it was sent from server
if ($html1['content_type']){
$arr= explode('charset=',$html1['content_type']);
$csethdr= strtolower(trim($arr[1]));
} else {
$csethdr= false;
}
$cset= false;
$arr= array();
//This has to replace page meta tags for charset with utf-8, but it doesn't actually help(see the bug info).
if (preg_match_all('/(<meta\s*http-equiv="Content-Type"\s*content="[^;]*;
\s*charset=([^"]*?)(?:"|\;)[^>]*>)/' //merge this line
,$html,$arr,PREG_PATTERN_ORDER)){
$cset= strtolower(trim($arr[2][0]));
if ($cset!='utf-8'||$cset!=$csethdr){
$new= str_replace($arr[2][0],'utf-8',$arr[1][0]);
$html= str_replace($arr[1][0],$new,$html);
$cset= $csethdr;
} else {
$cset= false;
}
if ($cset=='utf-8'){
$cset= false;
}
}
unset($arr);
if ($cset){
$html= iconv($cset,'utf-8',$html);
}
unset($cset);
//solve dom bug
$html=preg_replace('/<head[^>]*>/','<head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">',$html);
@$dom= new DOMDocument();
@$dom->loadHTML($html);
@$dom->preserveWhiteSpace = false;
function getMaxTextBody($dom){
$content = $dom->getElementsByTagname('div');
$content2= $dom->getElementsByTagname('td');
$content3= $dom->getElementsByTagname('p');
$content4 = $dom->getElementsByTagname('B');
$new = array();
foreach ($content as $value) {
$new[]= $value;
unset($value);
}
unset($content);
foreach ($content2 as $value) {
$new[]= $value;
unset($value);
}
unset($content2);
foreach ($content3 as $value) {
$new[]= $value;
unset($value);
}
unset($content3);
foreach ($content4 as $value) {
$new[]= $value;
unset($value);
}
unset($content4);
$maxlen= 0;
$result= '';
foreach ($new as $item)
{
$str= $item->nodeValue;
if (strlen($str)>$maxlen){
$content1= $item->getElementsByTagName('div');
$content2= $item->getElementsByTagname('td');
$content3= $item->getElementsByTagname('p');
$content4 = $dom->getElementsByTagname('b');
$contentnew= array();
foreach ($content1 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content1);
foreach ($content2 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content2);
foreach ($content3 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content3);
foreach ($content4 as $value) {
$contentnew[]= $value;
unset($value);
}
unset($content4);
// Insert data into database query
$query = mysql_query("INSERT INTO data (div,td,p,b) VALUES ('$content1','$content2','$content3','$content4')");
mysql_query($query) or die (mysql_error());
// Close the database connection
mysql_close();
if (count($contentnew)==0){
$result= $str;
} else {
foreach ($contentnew as $value) {
$str1= getMaxTextBody($value);
$str2= $value->nodeValue;
//let's say largest body has more than 50% of the text in its parent
if (strlen($str1)*2<strlen($str2)){
$str1= $str2;
}
if (strlen($str1)*2>strlen($str)&&strlen($str1)>$maxlen){
$result= $str1;
} elseif (strlen($str1)>$maxlen){
$result= $str1;
}
$maxlen= strlen($result);
}
}
$maxlen= strlen($result);
unset($contnentnew);
}
}
unset($new);
return $result;
}
print getMaxTextBody($dom);
?>
以下是我为存储内容而创建的MYSQL表
DROP TABLE IF EXISTS `data`;
CREATE TABLE `data` (
`div` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL ,
`td` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL ,
`p` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL ,
`b` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL ,
PRIMARY KEY (`div`)
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_unicode_ci;
}
任何有关我的内容未保存到数据库的帮助都将不胜感激。
答案 0 :(得分:2)
在构造sql查询时,您的内容变量确实是空的 - 它们在上面未设置。
$query = mysql_query("INSERT INTO data (div,td,p,b) VALUES \
('$content1','$content2','$content3','$content4')");
我认为您可能希望将每个$ content的$ value的nodeValue内容作为$ value字符串放入该mysql查询中(您可以使用$ value-&gt; nodeValue访问节点的文本内容)。
例如,如果您想要示例中的节点(如此P节点)的文本内容,则在print_r节点时看起来像这样:
DOMElement Object
(
[tagName] => p
[schemaTypeInfo] =>
[nodeName] => p
[nodeValue] => Students should note that all of the modules below may not
be available to them.
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => p
[baseURI] =>
[textContent] => Students should note that all of the modules below may \\
not be available to them.
)
您可以看到该节点中有两个值可能对您有用 - textContent and nodeValue。
您可以通过以下方式从代码中访问这些内容:
foreach ($content3 as $value) { // content3 contains the p nodes, I think?
// let's see what the node looks like
print_r($value);
// let's get hold of the text value from the node
$mytempvariable=$value->nodeValue;
print "CONTENT OF P NODE: \n\n$mytempvariable\n\n\n";
}
这将打印出所有P节点的文本。
答案 1 :(得分:0)
我认为使用memcache可以更快地完成您的流程。也只获得必要的div并将其存储在memcache中,并在需要时获取唯一的页面名称
答案 2 :(得分:0)
您没有在查询中转义变量$ contentX。我怀疑你发起的查询比你想象的要复杂得多。
查看http://php.net/manual/en/function.mysql-escape-string.php了解详情。
更广泛地说,您应该为查询使用PDO扩展和预备语句。
答案 3 :(得分:0)
未将$content1
插入$content4
的原因是因为您在foreach循环后取消它们。所以不会插入任何值。