Question

目前我正在抓取一个网站，我正在尝试删除一部分代码，我不想将其包含在数组中。

所以我目前的代码

$content['article'] = $html2->find('.hentry-content',0);
$content['article'] = $content['article']->plaintext;

这将返回我收集内容的网站上.hentry-content课程内的所有内容。

现在返回的内容看起来像这样。

array (
[article] => This is some example filler content please no actual meaning behind random bridge for bridge random you dog tomorrow http://example.com/our-random-mp3.com
)

现在在这个输出结束时它通常包含一个随机的MP3，无论如何我只能在不包含mp3的情况下拉出数组的内容部分？

Answer 1

如果链接位于<a>标记内，则此操作

foreach($content['article']->find('a') as $item) {
    $item->outertext = '';
}

echo $content['article']->plaintext;

Answer 2

如果返回的文本只包含一个指向随机mp3文件的链接，则可以使用以下命令对其进行过滤：

$url_pattern = '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/';
$content['article'] = preg_replace($url_pattern, '', $content['article']->plaintext);

这将删除文本中的所有网址。我从http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149获取了url-pattern。

从刮削阵列中移除部分

2 个答案: