如何获取alt内的内容?

时间:2016-06-06 21:15:57

标签: php regex

如何使用regex

获取alt标记内的内容

鉴于此文:

<a href="gallery.com/gallery-name"; target="_blank"> <img class="aligncenter" src="myblog.com/wp-content/image.jpg " alt=" I want to get this text " width=" 400 " height="300" /></a>

如何匹配I want to get this text

我已尝试过此alt=".*",但这会产生alt=" I want to get this text " width=" 400 " height="300",这是不可取的。

2 个答案:

答案 0 :(得分:1)

Foreward

你应该真的使用一个html解析器,但是你似乎对源字符串有创造性的控制权,如果真的很简单,那么应该减少边缘情况。

描述

<img\s(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\salt=['"]([^"]*)['"]?) (?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>

Regular expression visualization

此正则表达式将执行以下操作:

  • 找到所有图片代码
  • 要求图片代码具有alt属性
  • 捕获alt属性值并放入捕获组1
  • 允许值以单引号,双引号或无引号括起来
  • 避免一些非常困难的边缘情况,这会使匹配HTML变得困难

实施例

现场演示

https://regex101.com/r/cN0lD4/2

示例文字

请注意第二个img代码中的困难边缘情况。

<a href="gallery.com/gallery-name"; target="_blank"> <img class="aligncenter" src="myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" /></a>

<img onmouseover='  alt="This is not the droid you are looking for" ;'  class="aligncenter" src="myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />

样本匹配

  • 捕获组0获取整个img标记
  • 捕获组1只获取alt属性中的值,不包括任何周围的引号
[0][0] = <img class="aligncenter" src="myblog.com/wp-content/image.jpg" alt=" I want to get this text" width =" 400 " height="300" />
[0][1] =  I want to get this text

[1][0] = <img onmouseover='  alt="This is not the droid you are looking for" ;'  class="aligncenter" src="myblog.com/wp-content/image.jpg" alt="This is the droid I'm looking for." width =" 400 " height="300" />
[1][1] = This is the droid I'm looking for.

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  <img                     '<img'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    alt=                     'alt='
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"\s]*                 any character except: ''', '"',
                             whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  \/?                      '/' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------

答案 1 :(得分:0)

感谢那些帮助解决这个问题的人:

'/<img.*?alt="(.*?)".*>/'