经典ASP RegExp小变化

时间:2010-10-26 09:43:40

标签: html regex asp-classic

我有一些正则表达式代码,用于抓取页面上标题标记之间的数据:

<%
    Function UrlExists(sURL)
        Dim objXMLHTTP
        Dim thePage
        Dim strPTitle   
        Dim blnReturnVal
        Dim objRegExp
        Dim strTitleResponse

        'Create object
        Set objXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP")
        on error resume next

        'Get the head
        objXMLHTTP.Open "HEAD", sURL, false
        objXMLHTTP.setRequestHeader "User-Agent", Request.ServerVariables("HTTP_HOST")
        objXMLHTTP.Send ""

        '404?        
        If Err.Number <> 0 or objXMLHTTP.status <> 200 then blnReturnVal = "0|404 Error" Else blnReturnVal = "1|"
        objXMLHTTP.close

        'If not 404
        if left(blnReturnVal,1) = "1" then

            'Get the physical page
            objXMLHTTP.Open "GET", sURL, false
            objXMLHTTP.Send ""
                thePage = objXMLHTTP.responseText 
                thePage = replace(thePage, vbCrlf, "")
            objXMLHTTP.close

            'Find title
            Set objRegExp = New Regexp

            objRegExp.IgnoreCase = true
            objregexp.Multiline = true
            objRegExp.Global = false
            objRegExp.Pattern = "<title[^>]*?>(.*)</title>" 

            set strPTitle =  objRegExp.Execute(thePage)
            strTitleResponse = strPTitle.Item(0).Value
            strTitleResponse = replace(strTitleResponse, vbCrlf, "")
            strTitleResponse = trim(strTitleResponse)
            if len(strTitleResponse) <1 OR strTitleResponse = "" then strTitleResponse = "(No Title)"

            set objRegExp = nothing
            strTitleResponse = replace(strTitleResponse,"</title>","")
            strTitleResponse = replace(strTitleResponse,"<title>","")
            strTitleResponse = replace(strTitleResponse,"'","&#39; ")
            blnReturnVal = blnReturnVal & strTitleResponse

        end if

        Set objXMLHTTP = nothing

        UrlExists = blnReturnVal
    End Function
%>        

这很好用,已经好几个月,但是当我写它(愚蠢?)时我假设每个页面只有一个或没有标题标签。它最近开始在John Lewis page上抛出奇怪的错误,因为它的HTML中有两个标题:

    <title>John Lewis - Shop online at Britain's Favourite Retailer</title>
... bunch of html
<title>

    </title>

如何修改正则表达式以仅匹配第一个匹配的对,而不是与上面的HTML混淆?

1 个答案:

答案 0 :(得分:1)

在所有这些之前“你应该使用解析器”:让你正则表达式non-greedy

objRegExp.Pattern = "<title[^>]*?>(.*?)</title>" 

请注意?之后添加的.*。默认.*将尽可能匹配 。此行为与附加?相反,现在匹配尽可能少

警告:我对经典ASP(或“现代”ASP,如果有这样的东西)的正则表达式一无所知,但由于非贪婪/懒惰运算符已经用于<title>标记匹配,我认为它会起作用。