目前,我可以毫不费力地从我的desired website抓取内容,但如果您查看my demo,则可以看到在我的阵列中它只显示来源无论我改变什么,它都没有修复..
$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : '';
$html = file_get_html('http://screenrant.com/movie-news/'.$page);
foreach($html->find('#site-top ul h2 a') as $element)
{
print '<br><br>';
echo $url = ''.$element->href;
$html2 = file_get_html($url);
print '<br><br>';
$image = $html2->find('meta[property=og:image]',0);
print $news['image'] = $image->content;
print '<br><br>';
// Ending The Featured Image
$title = $html2->find(' header > h1',0);
print $news['title'] = $title->plaintext;
print '<br>';
// Ending the titles
print '<br>';
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
}
$news['content'] = $article->plaintext;
print '<br><br>';
#post> div:nth-child(2) > header > p > time
$date = $html2->find('header > p > time',0);
$news['date'] = $date->plaintext;
$dexp = explode(', ',$date);
print $date = $dexp[0].', '.$dexp[1];
print '<br><br>';
$genre = "news";
print '<br>';
mysqli_query($DB,"INSERT INTO `wp_scraped_news` SET
`hash` = '".$news['title']."',
`title` = '".$news['title']."',
`image` = '".$news['image']."',
`content` = '".$news['content']."'");
print '<pre>';print_r($news);print '</pre>';
}
目前正在使用simple_html_dom.php来清除。
答案 0 :(得分:1)
如果您看一下这段代码:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
//This is printing the article content line by line
}
$news['content'] = $article->plaintext;
//This is grabbing the last line of the article content AKA the source
//The last <p> as it's not in the foreach.
实际上,您需要这样做:
$articles = $html2->find('div.top-content > article > p');
foreach ($articles as $article) {
echo "$article->plaintext<p>";
$news['content'] = $news['content'] . $article->plaintext . "<p>";
}