fff.html是一封带有电子邮件地址的电子邮件,其中一些有href mailto链接,有些则没有,我想抓它们并将它们输出为以下格式
Lorem@ipsum.com,dolor@sit.com,amet@consectetur.com
我有一个简单的刮刀来获取那些与href链接的东西,但有些东西很奇怪
<?php
$url = "fff.html";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<a href="mailto:');
$end = strpos($content,'"',$start) + 8;
$mail = substr($content,$start,$end-$start);
print "$mail<br />";
?>
我应该为lorem ipsum的原始使用获得额外的积分
答案 0 :(得分:3)
问题是如果HTML页面中有多个电子邮件地址。 substr只返回第一个实例。这是一个将解析所有电子邮件地址的脚本。您可能需要调整一些以供您使用。它将以您请求的CSV格式输出结果。
<?php
$url = "fff.html";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content, '<body>');
$end = strpos($content, '</body>');
$data = substr($content, $start, $end-$start);
$pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
preg_match_all($pattern, $data, $matches);
foreach ($matches[1] as $key => $email) {
$emails[] = $email;
}
echo implode(', ', $emails );
?>