从正则表达式中排除开头

时间:2015-01-29 18:26:08

标签: regex

我需要这样的正则表达式:

<li><a href="/wiki/%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1" title="ააგებს">ააგებს</a></li>

将匹配:

%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1

到目前为止,我得到了:

<li><a href="/wiki/%.*\d

但我不知道如何从结果中排除开头。有任何想法吗?我使用的是Python。

3 个答案:

答案 0 :(得分:1)

不确定什么样的正则表达式,所以最好猜测:

/href="\/wiki\/((?:%[a-f0-9]{2})+)"/ig

答案 1 :(得分:1)

如果您使用的是.NET语言,那么您可以比使用正则表达式尝试获取值更强大。 HtmlAgilityPack适用于解析HTML,即使HTML有点格式不正确。

这里我有一个函数试图在一段HTML中提取第一个元素的href属性,然后程序的其余部分显示两种方法可以在&#34; / wiki之后提取部分href /&#34;:

Option Infer On

Imports System.Text.RegularExpressions
Imports HtmlAgilityPack

Module Module1

    ''' <summary>
    ''' Get the value of the href attribute in the first anchor (&lt;a>) element of (a fragment of) an HTML string.
    ''' </summary>
    ''' <param name="s">An HTML fragment.</param>
    ''' <returns>The value of the href attribute in the first anchor (&lt;a>) element.</returns>
    ''' <remarks>Throws a FormatException if the href value cannot be found.</remarks>
    Function GetHref(s As String) As String
        ' Get the value of the href attribute, if it exists, in a reliable fashion. '
        Dim htmlDoc As New HtmlDocument
        htmlDoc.LoadHtml(s)
        Dim link = htmlDoc.DocumentNode.SelectSingleNode("//a")
        Dim hrefValue = String.Empty

        If link IsNot Nothing Then
            If link.Attributes("href") IsNot Nothing Then
                hrefValue = link.Attributes("href").Value
            Else
                ' there was no href '
                Throw New FormatException("No href attribute in the <a> element.")
            End If
        Else
            ' there was no <a> element '
            Throw New FormatException("No <a> element.")
        End If

        Return hrefValue

    End Function

    Sub Main()
        Dim s = "<li><a href=""/wiki/%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1"" title=""ააგებს"">ააგებს</a></li>"

        Dim dataToCapture = String.Empty

        Dim hrefValue = GetHref(s)

        ' OPTION 1 - using RegEx
        ' Only get a specific pattern of characters
        Dim re = New Regex("^/wiki/((?:%[0-9A-F]{2})+)", RegexOptions.IgnoreCase)
        Dim m = re.Match(hrefValue)

        If m.Success Then
            dataToCapture = m.Groups(1).Value
            Console.WriteLine(dataToCapture)
        Else
            Console.WriteLine("Failed to match with RegEx.")
        End If

        ' OPTION 2 - looking at the string
        ' Just get whatever comes after the required start of the href value.
        Dim mustStartWith = "/wiki/"
        If hrefValue.StartsWith(mustStartWith) Then
            dataToCapture = hrefValue.Substring(mustStartWith.Length)
            Console.WriteLine(dataToCapture)
        Else
            Console.WriteLine("Nothing found with string operations.")
        End If

        ' the percent-encoded data could be decoded with System.Uri.UnescapeDataString(dataToCapture) '

        Console.ReadLine()

    End Sub

End Module

在正则表达式中,括号(即( ))表示要捕获的组。但是,我们不需要捕获单个%AA部分,因此这些部分具有?:修饰符,表示它们是非捕获组。

(虚假的只是帮助你正确地为代码着色。)

答案 2 :(得分:0)

当您使用Python时,您可以使用类似Python Regular Expression Testing Tool的内容:

>>> regex = re.compile("href=\"/wiki/((?:%[0-9A-F]{2})+)\"",re.IGNORECASE)
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xd640db26af2f1d60>
>>> regex.match(string)
None

# List the groups found
>>> r.groups()
(u'%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1',)

# List the named dictionary objects found
>>> r.groupdict()
{}

# Run findall
>>> regex.findall(string)
[u'%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1']

其中string设置为您的示例数据。

然而,与我在.NET中所展示的类似,用BeatifulSoup之类的内容解析HTML可能会更好,以获取href的值,然后对其进行处理。