正则表达式获取<script>标签

时间:2020-11-06 21:00:20

标签: javascript json regex

我正在尝试定位脚本中具有“ ” @ type”:“ NewsArticle” “的整个脚本标签。

类似:

<script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script>

我可以使用上述正则表达式来定位最上面的脚本标签。但是我要寻找的是newsArticle JSON信息,在这种情况下这里是第二个,但是在某些页面中有4个以上application / ld + json标记,但“ ” @ type”:“ NewsArticle” “无论如何,它始终存在于每个页面中。因此,我正在寻找可以针对特定脚本的脚本。

感谢您的帮助。


<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "@id": "https://www.givemesport.com/#gms",
    "name": "GiveMeSport",
    "url": "https://www.givemesport.com",
    "logo": {
        "@type": "ImageObject",
        "url": "https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
    },
    "sameAs":[
        "https://www.facebook.com/GiveMeSport",
        "https://www.instagram.com/givemesport",
        "https://twitter.com/GiveMeSport",
        "https://www.youtube.com/user/GiveMeSport"
    ]
}
</script>
    <script type="application/ld+json">
    {
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "mainEntityOfPage": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "url": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "headline": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "datePublished": "2020-10-30T21:52:48.3510000Z",
    "dateModified": "2020-10-30T21:52:48.3510000Z",
    "description": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "articleSection": "Football",
    "keywords": ["Football","Manchester United","Marcus Rashford","RB Leipzig","Scott McTominay","UEFA Champions"],
    "creator": ["Scott Wilson"],
    "thumbnailUrl": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
    "author": {
    "@type": "Person",
    "name": "Scott Wilson",
    "sameAs": "https://www.givemesport.com/scott-wilson-1"
    },
    "publisher": {
    "@id": "https://www.givemesport.com/#gms"
    },
    "image": {
    "@type": "ImageObject",
    "url": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
    "height": 620,
    "width": 960
    }
    }
</script>

1 个答案:

答案 0 :(得分:2)

很抱歉听到您不想遵循最佳实践,使用正则表达式解析HTML充满了问题。但是,如果您想快速解决问题,请使用

<script type="application\/ld\+json">((?:(?!<\/?script)[\w\W])*?"@type":\s*"NewsArticle"[\w\W]*?)<\/script>

请参见proof

说明

--------------------------------------------------------------------------------
  <script                  '<script type="application'
  type="application
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  ld                       'ld'
--------------------------------------------------------------------------------
  \+                       '+'
--------------------------------------------------------------------------------
  json">                   'json">'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        <                        '<'
--------------------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
--------------------------------------------------------------------------------
        script                   'script'
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [\w\W]                   any character of: word characters (a-
                               z, A-Z, 0-9, _), non-word characters
                               (all but a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
    "@type":                 '"@type":'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    "NewsArticle"            '"NewsArticle"'
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  script>                  'script>'