我正在尝试定位脚本中具有“ ” @ type”:“ NewsArticle” “的整个脚本标签。
类似:
<script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script>
我可以使用上述正则表达式来定位最上面的脚本标签。但是我要寻找的是newsArticle JSON信息,在这种情况下这里是第二个,但是在某些页面中有4个以上application / ld + json标记,但“ ” @ type”:“ NewsArticle” “无论如何,它始终存在于每个页面中。因此,我正在寻找可以针对特定脚本的脚本。
感谢您的帮助。
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Organization",
"@id": "https://www.givemesport.com/#gms",
"name": "GiveMeSport",
"url": "https://www.givemesport.com",
"logo": {
"@type": "ImageObject",
"url": "https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
},
"sameAs":[
"https://www.facebook.com/GiveMeSport",
"https://www.instagram.com/givemesport",
"https://twitter.com/GiveMeSport",
"https://www.youtube.com/user/GiveMeSport"
]
}
</script>
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "NewsArticle",
"mainEntityOfPage": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
"url": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
"headline": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
"datePublished": "2020-10-30T21:52:48.3510000Z",
"dateModified": "2020-10-30T21:52:48.3510000Z",
"description": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
"articleSection": "Football",
"keywords": ["Football","Manchester United","Marcus Rashford","RB Leipzig","Scott McTominay","UEFA Champions"],
"creator": ["Scott Wilson"],
"thumbnailUrl": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
"author": {
"@type": "Person",
"name": "Scott Wilson",
"sameAs": "https://www.givemesport.com/scott-wilson-1"
},
"publisher": {
"@id": "https://www.givemesport.com/#gms"
},
"image": {
"@type": "ImageObject",
"url": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
"height": 620,
"width": 960
}
}
</script>
答案 0 :(得分:2)
很抱歉听到您不想遵循最佳实践,使用正则表达式解析HTML充满了问题。但是,如果您想快速解决问题,请使用
<script type="application\/ld\+json">((?:(?!<\/?script)[\w\W])*?"@type":\s*"NewsArticle"[\w\W]*?)<\/script>
请参见proof
说明
--------------------------------------------------------------------------------
<script '<script type="application'
type="application
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
ld 'ld'
--------------------------------------------------------------------------------
\+ '+'
--------------------------------------------------------------------------------
json"> 'json">'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/? '/' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
script 'script'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[\w\W] any character of: word characters (a-
z, A-Z, 0-9, _), non-word characters
(all but a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
"@type": '"@type":'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
"NewsArticle" '"NewsArticle"'
--------------------------------------------------------------------------------
[\w\W]*? any character of: word characters (a-z,
A-Z, 0-9, _), non-word characters (all
but a-z, A-Z, 0-9, _) (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
script> 'script>'