在Swift中清理文本字符串

时间:2018-04-17 18:28:37

标签: swift string data-cleaning

我想在我的应用中使用一些有点混乱的文字。我无法控制文本,所以它就是这样。

我正在寻找一种轻量级的 1 方法来清理这里示例中显示的所有内容:

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode

因此,我们会看到特殊字符,例如&nbsp; unicode,如\u00f1,html段落,如<p></p>,新行内容,如\n\r ,在地方只有奇怪的反斜杠\。所需要的是翻译可翻译和删除其他垃圾。

虽然我可以直接操作字符串,单独处理这些内容,但我想知道是否有一种简单的 1 方法来清理这些字符串而没有太多的开销 1

已经提供了partial answer,但在我提供的示例中还有更多问题要解决。该解决方案转换HTML特殊字符,但没有格式化为\u0000的unicode,不删除HTML标记等。

我尝试过的其他事情

这不是我想要的全球解决方案,但它显示了解决问题的方向。

let samples = ["<p>This is test1</p>                                             ":"This is test1",
           "<p>This is u\\u00f1icode</p>                                      ":"This is u–icode",
           "<p>This is u&#x00f1;icode</p>                                       ":"This is u–icode",
           "<p>This is junk, but it's what I have<\\/p>\\r\\n                   ":"This is junk, but it's what I have",
           "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>":"Sometimes they emphasize like this, I could live with it",
           "<p>Occasionally we&nbsp;deal&nbsp;with this.</p>                 ":"Occasionally we deal with this."]

for (key, value) in samples {
    print ("original: \(key)      desired: \(value)" )
}

print("\n\n\n")

for (key, _) in samples {
    var _key = key.trimmingCharacters(in: CharacterSet.whitespaces)
    _key = _key.replacingOccurrences(of: "\\/", with: "/")

    if _key.hasSuffix("\\r\\n") { _key = String(_key.dropLast(4)) }
    if _key.hasPrefix("<p>") { _key = String(_key.dropFirst(3)) }
    if _key.hasSuffix("</p>") { _key = String(_key.dropLast(4)) }

    while let uniRange = _key[_key.startIndex...].range(of: "\\u") {
        let charDefRange = uniRange.upperBound..<_key.index(uniRange.upperBound, offsetBy: 4)
        let uniFullRange = uniRange.lowerBound..<charDefRange.upperBound
        let charDef = "&#x" + _key[charDefRange] + ";"

        _key = _key.replacingCharacters(in: uniFullRange, with: charDef)
    }

    let decoded = _key.stringByDecodingHTMLEntities
    print("decoded: \(decoded)")
}

输出

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is u&#x00f1;icode</p>                                          desired: This is uñicode
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode




decoded: Occasionally we deal with this.
decoded: Sometimes they \emphasize\ like this, I could live with it
decoded: This is uñicode
decoded: This is junk, but it's what I have
decoded: This is test1
decoded: This is uñicode

脚注: 1. 可能有许多较大的软件包或库可以将其作为其总功能的一小部分来实现,而且这些软件包或库不太重要。

1 个答案:

答案 0 :(得分:1)

我无法理解奇怪的反斜杠,但要删除HTML标签,HTML实体和转义符,您可以使用正则表达式执行以下替换:

请注意,您需要一个HTML实体字典,否则这将无效。转义的数量很少,并且创建完整的字典不会很复杂。

let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\\/p>\\r\\n",
    "<p>This is test1</p>",
    "<p>This is u\\u00f1icode</p>",
]

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
            continue
        }
        buffer.replaceCharacters(in: match.range, with: replacement)
    }

    return buffer as String
}

let htmlEntities = [
    "nbsp": "\u{00A0}"
]

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[$0]
    }
}

let escapeSequences = [
    "n": "\n",
    "r": "\r"
]

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\([a-z])") {
        return escapeSequences[$0]
    }
}

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)
}

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int($0, radix: 16)!)
        return code.map { String($0) }
    }
}

let purifiedStrings = strings
    .map(removeTags)
    .map(replaceHtmlEntities)
    .map(replaceEscapes)
    .map(replaceUnicodeSequences)

print(purifiedStrings.joined(separator: "\n"))

您也可以替换前导/尾随字符串并用单个空格替换多个空格,但这很简单。

您可以将其与How do I decode HTML entities in swift?

中的解决方案结合使用