如何忽略字符串中的http链接并返回其他所有内容?

时间:2013-08-25 23:15:47

标签: php regex html-parsing

我正在尝试解析一些HTML内容,这是HTML内容:

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

所以要解析这个并获得“事件名称”,“事件时间”和“流号”,我这样做:

preg_match_all('/<\/font>\s*([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2}).*?tream\s*(.*?)\s*<\/font><p>/', $data, $matches);

并且它正确地返回所有内容,但是也返回了带有http链接的流号,这是我不想要的。我只想要这个名字(对于某些人而言)&amp;仅限数字。

需要的数据:

5
4
CHANNEL TWO 2 STREAM
16
2
CHANNEL THREE 3 STREAM

目前它返回:

5
4
-online.html
16
2
-online.html

有人可以帮忙吗?在正则表达式中不是专业人士,过去2天一直在尝试。在此先感谢!!!

3 个答案:

答案 0 :(得分:1)

但是,如果你想要它在正则表达式,然后根据你的数据,你需要这个

preg_match_all('/(?:<\/font> )((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*)([^<]+) <[^>]+>(?:Stream )?([^h<]+)/', $data, $matches);

这会将名称放在$matches[1]$matches[2]中的时间和$matches[3]中的频道


正则表达式的解释:

  1. (?:<\/font> )搜索(并忽略)首先关闭新行上的字体标记,包括空格
  2. ((?:[^0-9]+(?:[0-9](?!\.|:|[0-9]))?(?:[0-9]{2}(?!\.|:))?)*)抓住所有不是一两个数字的东西,除非所说的数字后跟一个点或冒号(使用负向前瞻),根据需要重复并分组为一个
  3. ([^<]+)抓住所有内容到下一个“&lt;”,但不是尾随空格
  4. <[^>]+>忽略每一个标记,直到下一个“&gt;”并忽略“&gt;”以及
  5. (?:Stream )?如果第一个字是“流”,则忽略它
  6. ([^h<]+)抓住所有内容,直到小写“h”或“&lt;”

答案 1 :(得分:0)

描述

此表达式将:

  • 找到所有具有“gold”类
  • 的字体标签
  • 如果是第一个单词
  • ,请跳过单词Stream
  • 捕捉有趣的文字
  • 当它到达http://链接时停止捕获

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?\K(?:(?!\s*https?:|<\/font>).)*

enter image description here

实施例

Live Demo将鼠标悬停在蓝色区块上以查看匹配的原因

示例文字

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

<强>匹配

[0] => 5
[1] => 4
[2] => CHANNEL TWO 2 STREAM
[3] => 16
[4] => 2
[5] => CHANNEL THREE 3 STREAM

答案 2 :(得分:0)

描述

此表达式将:

  • 夺取标题
  • 捕获活动名称
  • 捕获活动时间
  • 找到所有有color = gold
  • 的字体标签
  • 跳过单词Stream(如果存在)
  • 捕捉有趣的文字
  • 当它到达http://链接时停止捕获
  • 从比赛周围修剪讨厌的白色空间
  • 总体而言,表达式允许字体标记属性出现在字体标记内的任何位置。表达式将避免一些非常困难的边缘情况

<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?green['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scolor=['"]?gold['"]?)(?:[^>=|&)]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)

实施例

Live Demo

示例文字

组0获得整场比赛
第1组获得标题
第2组获得事件名称
第3组获得活动时间
第4组获得流编号

<font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5</font><p>
<font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4</font><p>
<font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM http://http://domain.com/path/to/page-2-online.html</font><p>
<font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16</font><p>
<font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2</font><p>
<font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM http://domain.com/path/to/page-3-online.html</font><p>

PHP代码示例

<?php
$sourcestring="your source string";
preg_match_all('/<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?green[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>\s*(?:Stream\s*)?((?:(?!<\/font>).)*)<\/font>\s*[^<]*?([^<]+)\s+(\d+.\d+\s*\w{2}\s*-\s*\d+.\d+\s*\w{2})[^<]*?<font(?=\s|>)(?=(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?\scolor=[\'"]?gold[\'"]?)(?:[^>=|&)]|=\'(?:[^\']|\\')*\'|="(?:[^"]|\\")*"|=[^\'"][^\s>]*)*>(?:Stream\s*)?((?:(?!\s*https?:|<\/font>).)*)
/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

<强>匹配

[0][0] = <font color="green"> *TITLE* </font> Some Event Name 1:15pm-5:00pm <font color="gold">Stream 5
[0][1] = *TITLE* 
[0][2] = Some Event Name
[0][3] = 1:15pm-5:00pm
[0][4] = 5

[1][0] = <font color="green"> *TITLE* </font> Some: Event Name 1:30pm-5:00pm <font color="gold">Stream 4
[1][1] = *TITLE* 
[1][2] = Some: Event Name
[1][3] = 1:30pm-5:00pm
[1][4] = 4

[2][0] = <font color="green"> *TITLE* </font> Some, Event Name 1 with num 1:30pm-7:30pm <font color="gold">CHANNEL TWO 2 STREAM
[2][1] = *TITLE* 
[2][2] = Some, Event Name 1 with num
[2][3] = 1:30pm-7:30pm
[2][4] = CHANNEL TWO 2 STREAM

[3][0] = <font color="green"> *TITLE* </font> Event two 2.45pm-4.45pm <font color="gold">Stream 16
[3][1] = *TITLE* 
[3][2] = Event two
[3][3] = 2.45pm-4.45pm
[3][4] = 16

[4][0] = <font color="green"> *TITLE* </font> Event THREE summary 2.45pm-4.45pm <font color="gold">Stream 2
[4][1] = *TITLE* 
[4][2] = Event THREE summary
[4][3] = 2.45pm-4.45pm
[4][4] = 2

[5][0] = <font color="green"> *TITLE* </font> Event with a lot of summary 4:00pm-6:00pm <font color="gold">CHANNEL THREE 3 STREAM
[5][1] = *TITLE* 
[5][2] = Event with a lot of summary
[5][3] = 4:00pm-6:00pm
[5][4] = CHANNEL THREE 3 STREAM