我有一个问题。我创建了一个页面来从公共实体站点中删除一些数据。他们的网站在使用方面并未禁止这一点。无论如何,这是公共数据。我知道我为此写了一个向下和脏的页面,但我无法弄清楚为什么它继续循环。我的问题是我创建的模板页面,用于运行实际的scrape代码,使其保持连续运行。它一次又一次地开始。这是代码:
<?php
/*
Template Name: Scraping template
*/
$strFile = $_GET['scrape'];
$intNumOfRec = 0;
$intNumOfErr = 0;
$intHeaderLine = '';
function fnLogger ($strLine) {
$hdlLogFile = fopen("ScrapingLogFile","a") or die("Unable to open file!");
fwrite($hdlLogFile,$strLine."\r\n");
fclose($hdlLogFile);
return;
}
function fnProcessMcr() {
global $wpdb,$intNumOfRec,$intNumOfErr;
$intRecChunk = '50';
$strQuery = 'SELECT * FROM frg_subdivision_index WHERE authority is null limit '.$intRecChunk.';';
$objQuery = $wpdb->get_results($strQuery);
echo $strQuery.'</br>';
fnLogger($strQuery);
foreach($objQuery as $index=>$row)
{
fnLogger($row->id.' ');
if(strlen($row->book) !== 0 && strlen($row->map) !== 0 && strlen($row->begin) !== 0)
{
$url = '[url withheld]?q='.$row->book.'-'.$row->map.'-'.$row->begin;
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
$data = curl_exec($ch);
curl_close($ch);
$intCheckIndex = stripos($data,'<td class="right aligned"><h3 class="ui huge basic header">');
if(!$intCheckIndex)
{
//echo $row->id.': Could not find type prefix.';
$strType = 'Unknown';
$strJurisdiction = 'Unknown';
$intNumOfErr++;
}
else
{
$data = substr($data,$intCheckIndex+59);
$intCheckIndex = stripos($data,'<strong>[CANCELLED]</strong>');
$strType = "";
if($intCheckIndex !== false) {
$data = substr($data,$intCheckIndex+28);
$strType = 'Cancelled ';
}
$strType .= trim(substr($data,0,stripos($data,'Parcel')-1));
$data = substr($data,stripos($data,'Local Jurisdiction</td>')+23);
$data = substr($data,stripos($data,'<td>')+4);
$strJurisdiction = ucwords(strtolower(trim(substr($data,0,stripos($data,'<')))));
//echo ($index+$intBegRec).': Type is: '.$strType.' in '.$strJurisdiction;
$intNumOfRec++;
}
}
else
{
echo $row->id.': Missing book, map or begin.</br>';
$strType = 'Unknown';
$strJurisdiction = 'Unknown';
$intNumOfErr++;
}
$strUpdateResults = $wpdb->update('frg_subdivision_index',array(
'type' => $strType,
'authority' => $strJurisdiction),
array(
'id' => $row->id));
echo $row->id.': Type: '.$strType.' Authority:'.$strJurisdiction;
if($strUpdateResults === false)
{
echo ': ERROR update database.</br>';
$intNumOfErr++;
}
else
{
echo '</br>';
}
}
echo "</br></br>Number of records updated was: ".$intNumOfRec.'</br>';
echo "Number of errors was: ".$intNumOfErr.'</br>';
return;
}
switch ($strFile) {
case 'mcr':
fnLogger('Entered Switch Case mcr');
fnProcessMcr();
break;
case 'mcrunknown':
fnProcessMcrUnknown();
break;
default:
fnChangeTo404();
}
?>
以下是日志文件的输出,以便您可以看到它正在做什么。
Entered Switch Case mcr
SELECT * FROM frg_subdivision_index WHERE authority is null limit 50;
30729
30730
30731
30732
30733
30734
30735
30736
30737
30738
30739
30740
30741
30742
30743
30744
30745
30746
30747
30748
30749
30750
30751
30752
30753
30754
30755
30756
30757
30758
30759
30760
30761
30762
30763
30764
30765
30766
30767
30768
Entered Switch Case mcr
SELECT * FROM frg_subdivision_index WHERE authority is null limit 50;
30768
30769
30769
30770
30770
30771
30771
30772
30772
30773
30773
30774
30774
30775
30775
30776
30776
30777
30777
30778
30778
30779
30780
30781
30782
30783
30784
30785
30786
30787
30788
30789
30790
30791
30792
30793
30794
30795
30796
30797
30798
30799
30800
30801
30802
30803
30804
30805
30806
30807
30808
Entered Switch Case mcr
SELECT * FROM frg_subdivision_index WHERE authority is null limit 50;
30808
30809
30809
30810
30810
30811
30811
30812
30812
30813
30813
30814
30814
30815
30815
30816
30816
30817
30817
30818
30819
30820
30821
30822
30823
30824
30825
30826
30827
30828
30829
30830
Entered Switch Case mcr
SELECT * FROM frg_subdivision_index WHERE authority is null limit 50;
30830
30831
30831
30832
30832
30833
30833
30834
30834
30835
30835
30836
30836
30837
30837
30838
30838
30839
30839
30840
30840
30841
30841
Entered Switch Case mcr
SELECT * FROM frg_subdivision_index WHERE authority is null limit 50;
30841
30842
任何人都知道为什么它会一直循环回来?
答案 0 :(得分:0)
好的,代码没有问题。它正在处理Wordpress中的超时。