我已经在php
中编写了一个脚本,以从网页中抓取titles
及其links
并将其相应地写入一个csv文件。当我处理分页网站时,只有最后一页的内容保留在csv文件中,其余内容被覆盖。我尝试使用写入模式w
。但是,当我使用附加a
执行相同操作时,我会在该csv文件中找到所有数据。
由于appending
和writing
数据使csv文件多次打开和关闭(由于我可能错误地应用了循环),因此脚本的效率和耗时降低了。
我该如何有效地进行操作,当然还要使用(写入)w
模式?
这是我到目前为止写的:
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
$infile = fopen("itemfile.csv","a");
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
fclose($infile);
}
for($i = 1; $i<10; $i++){
get_content($link.$i);
}
?>
答案 0 :(得分:2)
如果您不想多次打开和关闭文件,请在for
循环之前移动打开脚本,然后在以下位置将其关闭:
function get_content($url, $inifile)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
echo "{$itemTitle},{$itemLink}<br>";
fputcsv($infile,[$itemTitle,$itemLink]);
}
}
$infile = fopen("itemfile.csv","w");
for($i = 1; $i<10; $i++) {
get_content($link.$i, $inifile);
}
fclose($infile);
?>
答案 1 :(得分:1)
我会考虑不在get_content
函数中向文件回显或写入结果。我将对其进行重写,以便仅获取 内容,因此我可以按照自己喜欢的任何方式处理提取的数据。这样的事情(请阅读代码注释):
<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page=";
// This function does not write data to a file or print it. It only extracts data
// and returns it as an array.
function get_content($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = new simple_html_dom();
$dom->load($htmlContent);
// We don't need the following line anymore
// $infile = fopen("itemfile.csv","a");
// We will collect extracted data in an array
$result = [];
foreach($dom->find('.question-summary') as $file){
$itemTitle = $file->find('.question-hyperlink', 0)->innertext;
$itemLink = $file->find('.question-hyperlink', 0)->href;
$result []= [$itemTitle, $itemLink];
// echo "{$itemTitle},{$itemLink}<br>";
// No need to write to file, so we don't need the following as well
// fputcsv($infile,[$itemTitle,$itemLink]);
}
// No files opened, so the following line is no more required
// fclose($infile);
// Return extracted data from this specific URL
return $result;
}
// Merge all results (result for each url with different page parameter
// With a little refactoring, get_content() can handle this as well
$result = [];
for($page = 1; $page < 10; $page++){
$result = array_merge($result, get_content($link.$page));
}
// Now do whatever you want with $result. Like writing its values to a file, or print it, etc.
// You might want to write a function for this
$outputFile = fopen("itemfile.csv","a");
foreach ($result as $row) {
fputcsv($outputFile, $row);
}
fclose($outputFile);
?>