Question

我有以下代码片段，它基本上解析我的博客网站并将一些信息存储为变量：

global $articles;

$items = $html->find('div[class=blogpost]'); 

foreach($items as $post) {
    $articles[] = array($post->children(0)->innertext,
                        $post->children(1)->first_child()->outertext);
}

foreach($articles as $item) {
    echo $item[0]; 
    echo $item[1];
    echo "<br>";
}

以上代码输出如下：

Title of blog post 1 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/cool_news" id="963"  target="_blank" >Click here for news</a> &nbsp;<img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
Title of blog post 2 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/neato" id="963"  target="_blank" >Click here for neato</a> &nbsp;<img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
Title of blog post 3 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/lame" id="963"  target="_blank" >Click here for lame</a> &nbsp;<img src="/news.gif" width="12" height="12" title="validated" /><span class="title">

包含$ item [0]包含“博客文章标题X”和$ item [1]包含其余内容。

我想要做的是解析$ item [1]并仅保留其中包含的URL作为单独的变量。也许我没有正确地表达我的问题，但我找不到任何可以帮助我解决这个问题的事情。

任何人都可以帮助我吗？

Answer 1

如果要将$item[1]解析为用于$html的任何DOM爬虫对象，则可以使用以下XPath

$item[1]->find('//a[0]/@href');

将返回

href="http://www.example.com/cool_news"

然后使用PHP或精简XPath查询提取您想要的URL。不确定XPath将获得什么价值，也许有人可能会扩展它。

编辑：当您使用Simple DOM Parser时，请尝试以下

$blogItemHtml = new simple_html_dom();
$blogItemHtml->load($item[1]);

$anchors = $blogItemHtml->find('a');
echo $anchors[0]->href; // "http://www.example.com/cool_news"

从PHP中的数组值中删除部分字符串

1 个答案: