正则表达式匹配锚标记及其href

时间:2017-05-05 23:05:02

标签: regex swift nsregularexpression

我想通过一个包含多个锚标签的html字符串运行正则表达式,并构建一个链接文本字典与其href url。

<p>This is a simple text with some embedded <a href="http://example.com/link/to/some/page?param1=77&param2=22">links</a>. This is a <a href="https://exmp.le/sample-page/?uu=1">different link</a>.

如何一次性提取<a>代码的文字和href?

编辑:

func extractLinks(html: String) -> Dictionary<String, String>? {

    do {
        let regex = try NSRegularExpression(pattern: "/<([a-z]*)\b[^>]*>(.*?)</\1>/i", options: [])
        let nsString = html as NSString
        let results = regex.matchesInString(html, options: [], range: NSMakeRange(0, nsString.length))
        return results.map { nsString.substringWithRange($0.range)}
    } catch let error as NSError {
        print("invalid regex: \(error.localizedDescription)")
        return nil
    }
}

1 个答案:

答案 0 :(得分:2)

首先,您需要了解pattern NSRegularExpression的基本语法:

  • pattern不包含分隔符
  • pattern不包含修饰符,您需要传递options
  • 等信息
  • 如果要使用元字符\,则需要在Swift String中将其作为\\转义。

因此,创建NSRegularExpression实例的行应该是这样的:

let regex = try NSRegularExpression(pattern: "<([a-z]*)\\b[^>]*>(.*?)</\\1>", options: .caseInsensitive)

但是,您可能已经知道,您的模式不包含任何匹配href的代码或捕获其值。

这样的内容适用于您的示例html

let pattern = "<a\\b[^>]*\\bhref\\s*=\\s*(\"[^\"]*\"|'[^']*')[^>]*>((?:(?!</a).)*)</a\\s*>"
let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)
let html = "<p>This is a simple text with some embedded <a\n" +
    "href=\"http://example.com/link/to/some/page?param1=77&param2=22\">links</a>.\n" +
    "This is a <a href=\"https://exmp.le/sample-page/?uu=1\">different link</a>."
let matches = regex.matches(in: html, options: [], range: NSRange(0..<html.utf16.count))
var resultDict: [String: String] = [:]
for match in matches {
    let hrefRange = NSRange(location: match.rangeAt(1).location+1, length: match.rangeAt(1).length-2)
    let innerTextRange = match.rangeAt(2)
    let href = (html as NSString).substring(with: hrefRange)
    let innerText = (html as NSString).substring(with: innerTextRange)
    resultDict[innerText] = href
}
print(resultDict)
//->["different link": "https://exmp.le/sample-page/?uu=1", "links": "http://example.com/link/to/some/page?param1=77&param2=22"]

请记住,我上面的pattern可能会错误地检测到错误的a-tag或错过某些嵌套结构,而且它缺少使用HTML字符实体的功能......

如果您想使代码更加健壮和通用,您最好考虑采用ColGraff和Rob建议的HTML解析器。