preg_replace在简单的html dom中

时间:2013-06-24 18:39:47

标签: preg-replace simple-html-dom

所以我试图从网站上获取最新消息并将其自己包含在内。 该站点使用Joomla(ugh),结果内容href缺少基本href。 所以链接将持有contensite.php?blablabla。这将导致链接http://www.mysite.com/contensite.php?blablabla

所以我想在回复之前用'http://www.basehref.com'替换'http://'。但我的知识停在这里。 我应该用吗? preg_replace,str_replace?我不确定。

2 个答案:

答案 0 :(得分:0)

所以我不能(因为我缺乏preg匹配的知识)修复损坏的链接,而是用其他链接替换它们,并将链接的类替换为我的fancybox类,这样它将打开源代码网站在fancybox。

include_once('db_connect.php');
// connect to my db

include_once('dom.php');
// include html_simple_dom!

$dom = file_get_html('http://www.remotesite.com');
// get the html content of a site and pass it through html simple dom !

$elem = $dom->find('div[class=blog]', 0);
// set the div to target for !



$pattern = '/(?<=href\=")[^]]+?(?=")/';
$replacement ='http://www.remotesite.com';
$replacedHrefHtml = preg_replace($pattern, $replacement, $elem);
// replacement 1
// replace the broken links (base href is missing , joomla sucks , period !)
// im to lazy to preg_match it any other way, feel free to improve this !

$pattern2 = '/contentpagetitle/';
$replacement2 ='fancybox fancybox.iframe';
$replacedHrefHtml2 = preg_replace($pattern2, $replacement2,$replacedHrefHtml );
// replacement 2
// replace the joomla class on the links with the class contentpagetitle to my fancybox     class ! fancy innit!


$pattern2 = '/readon/';
$replacement2 ='fancybox fancybox.iframe';
$replacedHrefHtml2 = preg_replace($pattern2, $replacement2,$replacedHrefHtml );
// replacement 2
// replace the joomla class on the links  with class readon to my fancybox class ! fancy innit!

$replacedHrefHtml3 = preg_replace("/<img[^>]+\>/i", "<br />(Plaatje)<br /><br /> ",         $replacedHrefHtml2); 
// finally remove the images from the string !


$replacedHrefHtml4 = base64_encode($replacedHrefHtml3);
// encode the html with base64 before store to mysel 
// real escape wont work since it will break the links !

 try {
$conn = new PDO($link, $pdo_username, $pdo_password);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$data222 = $conn->query('SELECT * FROM svvnieuws ORDER BY id DESC LIMIT 1');

foreach($data222 as $row) { 

 $lastitem = sprintf($row[inhoud]);

   }
 } catch(PDOException $e) {
echo 'ERROR: ' . $e->getMessage();
}                        
// get the last stored item in db for comparisation to current result!

if ($replacedHrefHtml4 == $lastitem){
// if the last item from the db is the same, do not store a new item ! importand to prevent clutter !

}
else {
// if its not the same, store a new item !

$conn = new PDO($link, $pdo_username, $pdo_password);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
// set up the connection to the db

$sql='INSERT INTO svvnieuws (id,inhoud) VALUES ("","'.$replacedHrefHtml4.'")';
// set the mysql query string

$rip = $conn->prepare($sql);
$rip->execute(array(':id'=>$id,
              ':inhoud'=>$replacedHrefHtml4
              ));
// insert to the db !

}
// close the else !

// place this file outside of the docroot, and let the cron run it every say 4 hours. 
// ofcourse make sure you also place dom.php in the same directory!
// dom.php is my short name for php simple html dom.

所以替换1取代了     &LT; a href =&#34;无论如何&#34;&gt;到&lt; a href =&#34; www.remotesite.com&#34;&gt;
replace 2将该href上的类替换为fancybox replace 3将readon链接上的类替换为fancybox 与上次存储的项目进行比较 如果存储不同的话。

我很想知道,如何修复损坏的链接而不是替换它们。 来自该站点的链接源自如下:&lt; a href =&#34; /index.php?blabla&#34;&gt; 如果有可能我能够将www.mysite.com注入&lt; a href =&#34; /index.php?blabla&#34;&gt;制作它&lt; a href =&#34; www.remotesite.com/index.php?blabla&#34;&gt;

答案 1 :(得分:0)

include_once('db_connect.php');
// connect to my db
require_once('Net/URL2.php');
include_once('dom.php');
// include html_simple_dom!

$dom = file_get_html('http://www.targetsite.com');
// get the html content of a site and pass it through html simple dom !

$elem2 = $dom->find('div[class=blog]', 0);
// set the div to target for !


$uri = new Net_URL2('http://www.svvenray.nl'); // URI of the resource
$baseURI = $uri;
foreach ($elem2->find('base[href]') as $elem) {
$baseURI = $uri->resolve($elem->href);
}

foreach ($elem2->find('*[src]') as $elem) {
$elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($elem2->find('*[href]') as $elem) {
if (strtoupper($elem->tag) === 'BASE') continue;
$elem->href = $baseURI->resolve($elem->href)->__toString();
}

echo $elem2; 

这将修复所有损坏的链接,并需要PHP PEAR Net / URL2.php