我需要这样的正则表达式:
<li><a href="/wiki/%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1" title="ააგებს">ააგებს</a></li>
将匹配:
%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1
到目前为止,我得到了:
<li><a href="/wiki/%.*\d
但我不知道如何从结果中排除开头。有任何想法吗?我使用的是Python。
答案 0 :(得分:1)
不确定什么样的正则表达式,所以最好猜测:
/href="\/wiki\/((?:%[a-f0-9]{2})+)"/ig
答案 1 :(得分:1)
如果您使用的是.NET语言,那么您可以比使用正则表达式尝试获取值更强大。 HtmlAgilityPack适用于解析HTML,即使HTML有点格式不正确。
这里我有一个函数试图在一段HTML中提取第一个元素的href属性,然后程序的其余部分显示两种方法可以在&#34; / wiki之后提取部分href /&#34;:
Option Infer On
Imports System.Text.RegularExpressions
Imports HtmlAgilityPack
Module Module1
''' <summary>
''' Get the value of the href attribute in the first anchor (<a>) element of (a fragment of) an HTML string.
''' </summary>
''' <param name="s">An HTML fragment.</param>
''' <returns>The value of the href attribute in the first anchor (<a>) element.</returns>
''' <remarks>Throws a FormatException if the href value cannot be found.</remarks>
Function GetHref(s As String) As String
' Get the value of the href attribute, if it exists, in a reliable fashion. '
Dim htmlDoc As New HtmlDocument
htmlDoc.LoadHtml(s)
Dim link = htmlDoc.DocumentNode.SelectSingleNode("//a")
Dim hrefValue = String.Empty
If link IsNot Nothing Then
If link.Attributes("href") IsNot Nothing Then
hrefValue = link.Attributes("href").Value
Else
' there was no href '
Throw New FormatException("No href attribute in the <a> element.")
End If
Else
' there was no <a> element '
Throw New FormatException("No <a> element.")
End If
Return hrefValue
End Function
Sub Main()
Dim s = "<li><a href=""/wiki/%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1"" title=""ააგებს"">ააგებს</a></li>"
Dim dataToCapture = String.Empty
Dim hrefValue = GetHref(s)
' OPTION 1 - using RegEx
' Only get a specific pattern of characters
Dim re = New Regex("^/wiki/((?:%[0-9A-F]{2})+)", RegexOptions.IgnoreCase)
Dim m = re.Match(hrefValue)
If m.Success Then
dataToCapture = m.Groups(1).Value
Console.WriteLine(dataToCapture)
Else
Console.WriteLine("Failed to match with RegEx.")
End If
' OPTION 2 - looking at the string
' Just get whatever comes after the required start of the href value.
Dim mustStartWith = "/wiki/"
If hrefValue.StartsWith(mustStartWith) Then
dataToCapture = hrefValue.Substring(mustStartWith.Length)
Console.WriteLine(dataToCapture)
Else
Console.WriteLine("Nothing found with string operations.")
End If
' the percent-encoded data could be decoded with System.Uri.UnescapeDataString(dataToCapture) '
Console.ReadLine()
End Sub
End Module
在正则表达式中,括号(即( )
)表示要捕获的组。但是,我们不需要捕获单个%AA
部分,因此这些部分具有?:
修饰符,表示它们是非捕获组。
(虚假的只是帮助你正确地为代码着色。)
答案 2 :(得分:0)
当您使用Python时,您可以使用类似Python Regular Expression Testing Tool的内容:
>>> regex = re.compile("href=\"/wiki/((?:%[0-9A-F]{2})+)\"",re.IGNORECASE)
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xd640db26af2f1d60>
>>> regex.match(string)
None
# List the groups found
>>> r.groups()
(u'%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1',)
# List the named dictionary objects found
>>> r.groupdict()
{}
# Run findall
>>> regex.findall(string)
[u'%E1%83%90%E1%83%90%E1%83%92%E1%83%94%E1%83%91%E1%83%A1']
其中string
设置为您的示例数据。
然而,与我在.NET中所展示的类似,用BeatifulSoup之类的内容解析HTML可能会更好,以获取href的值,然后对其进行处理。