我在使用相当简单的抓取程序时遇到了一些麻烦。希望工作部分帮助某人建立新闻聚合器,如果他们感兴趣的话。我喜欢得到具体的东西,所以我发现抓住内容并按关键词过滤有助于我避免关于与我生活无关的事情的头条新闻。
我非常感谢你们所有人提供的任何帮助或建议。这个问题一直困扰着我几天,我没有找到一个好的解决方案。
基本上,程序从表中提取一组项目以移动到另一个表,并且在此过程中应该从当前正在处理的表A中获取表项的URL组件,并使用include调用另一个程序。程序B使用它来从该条目中获取内容并将其保存到文件中,该文件根据表A中当前项目的其他属性命名。
然后将文件保存到文件夹中,并清除程序B中的dom变量,以便循环可以继续。
这是程序A:
<?php
$db = 'newsfeed';
$zeta = 0;
$beta = 0;
// connect to RDS instance MySQL Database Newsfeed
include_once('/var/www/dbfunctions/mysqli_connectdb.php');
// set content source table
$sourcetable = 'feedsources';
$mastertable = 'mastertable';
// set date to remove results older than
date_default_timezone_set("UTC");
$datenow = date_timestamp_get(date_create());
$offset = "86400";
$deldate = $datenow - $offset;
//begin cycling through content data
//delete all "old" entries from the mastertable
//get number of source items present
$itemquery = "SELECT id,name FROM $sourcetable";
$itemresult = mysqli_query($conn, $itemquery);
while ($row = mysqli_fetch_assoc($itemresult)) {
$sourceid = $row['id'];
$sourcename = $row['name'];
// cycle throught the data tables
$dataquery = "SELECT * FROM $sourcetable WHERE id = $sourceid;";
$dataresult = mysqli_query($conn, $dataquery);
while ($row = mysqli_fetch_assoc($dataresult)) {
$table = $row['datatable'];
}
// copy all data from the targetted table into the master table
//loop through the targetted table and copy to mysql
$getdata = "SELECT * FROM ".$table.";";
$datareturn = mysqli_query($conn, $getdata);
while ($row = mysqli_fetch_assoc($datareturn)) {
$date = $row['datecreated'];
$title = addslashes($row['title']);
$url = addslashes($row['url']);
$tags = addslashes($row['tags']);
$titleid = $row['id'];
//get content and place in html file in /var/www/html/nuzr/content/
//check whether the item already exists in the table
$checkquery = "select id from ".$mastertable." where title = '".$title."';";
$checkcheck = mysqli_query($conn, $checkquery);
if(mysqli_num_rows($checkcheck) > 0){
}else{
require_once("getcontent.php");
$copy = "INSERT INTO ".$mastertable." VALUES ('NULL','$table','$sourcename','$date','$title','$url','$tags','$filename');";
mysqli_query($conn, $copy);
echo "Beta is ".$beta;
$beta = $beta + 1;
}
}
// clean the master table
$delquery = 'DELETE FROM '.$mastertable.' WHERE datecreated < '.$deldate.';';
mysqli_query($conn, $delquery);
}
function clear()
{
$this->dom = null;
$this->parent = null;
$this->parent = null;
$this->children = null;
}
?>
计划B
<?php
//Check Start
//echo "Program Starts";
// Include the library
include('/var/www/tools/dom/simple_html_dom.php');
$source = $url;
$content = array();
$header1 = array();
$header2 = array();
$i = 0; $y = 0;
// Retrieve the DOM from a given URL
$html = file_get_html($source);
//grab headers in case initial title is a header
foreach($html->find('h1') as $e){
$header1[$i] = $e->outertext;
//echo $e->outertext;
$i = $i + 1;
}
$i = 0;
foreach($html->find('h2') as $e){
$header2[$i] = $e->outertext;
//echo $e->outertext;
$i = $i + 1;
}
//reset counter
$i = 0;
// Find all paragraph tags and print their content into a text file
foreach($html->find('p') as $e){
$content[$i] = $e->outertext;
//echo $e->outertext;
$i = $i + 1;
}
//create the content storage file
$filename = "/var/www/html/nuzr/content/".$table.$titleid.".html";
echo "The filename is".$filename;
$file = fopen($filename,"a");
// write header and link to original article
$titleblurb = "<b>Original article courtesy of <a href='".$url."'>".$sourcename."</a></b>";
fwrite($file, $titleblurb);
// set site specific parameters based on header / footer size
if($sourcename == "The Globe and Mail"){
//Set indexing parameters
$z = $i - 13; $y = 2;
//Add Header content
$text = $header1[0];
fwrite($file, $text);
$text = $header2[1];
fwrite($file, $text);
}elseif($sourcename == "CNN Money"){
//Set indexing parameters
$z = $i - 3; $y = 1;
//Add header content
$text = $header1[0];
fwrite($file, $text);
$text = $header2[1];
fwrite($file, $text);
}elseif($sourcename == "CNN Markets"){
//Set indexing parameters
$z = $i - 3; $y = 1;
//Add header content
$text = $header1[0];
fwrite($file, $text);
//$text = $header2[1];
//fwrite($file, $text);
}elseif($sourcename == "BBC Business"){
//Set indexing parameters
$z = $i - 9; $y = 1;
//Add header content
$text = $header1[0];
fwrite($file, $text);
//$text = $header2[1];
//fwrite($file, $text);
}elseif($sourcename == "BBC Politics"){
//Set indexing parameters
$z = $i - 0; $y = 1;
//Add header content
$text = $header1[0];
fwrite($file, $text);
//$text = $header2[1];
//fwrite($file, $text);
}else{
echo $sourcename;
}
do{
$text = $content[$y];
fwrite($file, $text);
$y = $y +1;
}while($y<$z);
echo "Zeta is".$zeta;
$zeta = $zeta +1;
//close the content file
fclose($file);
//echo "File end.";
$html->clear();
unset($html);
?>
当我用所有这些输出回声作为更新点来运行它时,看起来包含(程序B)仅运行第一次迭代然后它停止运行。我收到了file_get_html()的问题,直到我添加了clear。
答案 0 :(得分:0)
您需要将PROGRAM B中的代码放在
中 while ($row = mysqli_fetch_assoc($datareturn)) {
fetchURL($url);
}
这样对于每个条目,当你有$ url时,你可以调用程序B.我建议将PROGRAM B作为函数放置并在while循环中调用该函数。
类似这样的事情
function fetchURL($url) {
// Place PROGRAM B here. You can make the `include` be part of the program A at the top.
}