我想通过一个包含多个锚标签的html字符串运行正则表达式,并构建一个链接文本字典与其href url。
<p>This is a simple text with some embedded <a href="http://example.com/link/to/some/page?param1=77¶m2=22">links</a>.
This is a <a href="https://exmp.le/sample-page/?uu=1">different link</a>.
如何一次性提取<a>
代码的文字和href?
编辑:
func extractLinks(html: String) -> Dictionary<String, String>? {
do {
let regex = try NSRegularExpression(pattern: "/<([a-z]*)\b[^>]*>(.*?)</\1>/i", options: [])
let nsString = html as NSString
let results = regex.matchesInString(html, options: [], range: NSMakeRange(0, nsString.length))
return results.map { nsString.substringWithRange($0.range)}
} catch let error as NSError {
print("invalid regex: \(error.localizedDescription)")
return nil
}
}
答案 0 :(得分:2)
首先,您需要了解pattern
NSRegularExpression
的基本语法:
pattern
不包含分隔符pattern
不包含修饰符,您需要传递options
\
,则需要在Swift String中将其作为\\
转义。因此,创建NSRegularExpression
实例的行应该是这样的:
let regex = try NSRegularExpression(pattern: "<([a-z]*)\\b[^>]*>(.*?)</\\1>", options: .caseInsensitive)
但是,您可能已经知道,您的模式不包含任何匹配href
的代码或捕获其值。
这样的内容适用于您的示例html
:
let pattern = "<a\\b[^>]*\\bhref\\s*=\\s*(\"[^\"]*\"|'[^']*')[^>]*>((?:(?!</a).)*)</a\\s*>"
let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)
let html = "<p>This is a simple text with some embedded <a\n" +
"href=\"http://example.com/link/to/some/page?param1=77¶m2=22\">links</a>.\n" +
"This is a <a href=\"https://exmp.le/sample-page/?uu=1\">different link</a>."
let matches = regex.matches(in: html, options: [], range: NSRange(0..<html.utf16.count))
var resultDict: [String: String] = [:]
for match in matches {
let hrefRange = NSRange(location: match.rangeAt(1).location+1, length: match.rangeAt(1).length-2)
let innerTextRange = match.rangeAt(2)
let href = (html as NSString).substring(with: hrefRange)
let innerText = (html as NSString).substring(with: innerTextRange)
resultDict[innerText] = href
}
print(resultDict)
//->["different link": "https://exmp.le/sample-page/?uu=1", "links": "http://example.com/link/to/some/page?param1=77¶m2=22"]
请记住,我上面的pattern
可能会错误地检测到错误的a-tag或错过某些嵌套结构,而且它缺少使用HTML字符实体的功能......
如果您想使代码更加健壮和通用,您最好考虑采用ColGraff和Rob建议的HTML解析器。